<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kwansub Yun</title>
    <description>The latest articles on DEV Community by Kwansub Yun (@flamehaven01).</description>
    <link>https://dev.to/flamehaven01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3508506%2Fe2f9bc29-10d2-41ec-8e77-19b8b5cfd9e9.jpg</url>
      <title>DEV Community: Kwansub Yun</title>
      <link>https://dev.to/flamehaven01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/flamehaven01"/>
    <language>en</language>
    <item>
      <title>From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Fri, 08 May 2026 08:25:50 +0000</pubDate>
      <link>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</link>
      <guid>https://dev.to/flamehaven01/from-score-to-workflow-turning-stem-bio-ai-into-a-local-audit-system-5amp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.&lt;/p&gt;

&lt;p&gt;That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the broader arc, the full series is here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;STEM-AI / STEM BIO-AI series&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But after that, a different engineering problem took over.&lt;/p&gt;

&lt;p&gt;The audit logic was stricter.&lt;br&gt;&lt;br&gt;
The reports were richer.&lt;br&gt;&lt;br&gt;
The reasoning was more bounded.&lt;/p&gt;

&lt;p&gt;But the developer workflow still felt too loose.&lt;/p&gt;

&lt;p&gt;So the next question was no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How do I score trust?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The answer turned out to be less about seeing more signals and more about refusing to confuse them.&lt;/p&gt;

&lt;p&gt;That is the core argument of this post:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The problem was no longer scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgevfs1ir5axbqcnehca4.png" alt="The problem was no longer scoring" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.&lt;/p&gt;

&lt;p&gt;The bottleneck was operational clarity.&lt;/p&gt;

&lt;p&gt;A trust audit tool is not very useful if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the normal path is one long command with too many flags&lt;/li&gt;
&lt;li&gt;CI has to reverse-engineer the result from human-readable stdout&lt;/li&gt;
&lt;li&gt;bio-specific diagnostics are mixed directly into the same surface as formal scoring&lt;/li&gt;
&lt;li&gt;regulatory relevance shows up as vague implication instead of explicit traceability&lt;/li&gt;
&lt;li&gt;advisory AI is present, but its relationship to the official score is unclear&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.&lt;/p&gt;

&lt;p&gt;That is a different class of problem.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;The CLI had to reflect operator intent&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier CLI was functional, but too flat.&lt;/p&gt;

&lt;p&gt;You could do things like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T3 &lt;span class="nt"&gt;--format&lt;/span&gt; json &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem /path/to/repo &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
stem /path/to/repo &lt;span class="nt"&gt;--advisory-response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All of that worked.&lt;/p&gt;

&lt;p&gt;The issue was that it treated very different operator intents as one long option surface.&lt;/p&gt;

&lt;p&gt;In practice, these are separate workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scan a repository and generate artifacts&lt;/li&gt;
&lt;li&gt;enforce a gate in CI/CD&lt;/li&gt;
&lt;li&gt;export a bounded advisory packet&lt;/li&gt;
&lt;li&gt;validate a downstream provider response&lt;/li&gt;
&lt;li&gt;cross an explicit provider-call boundary&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So I refactored the CLI around workflows instead of flag accumulation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan &amp;lt;folder&amp;gt;
stem gate &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2
stem advisory validate &amp;lt;folder&amp;gt;
stem advisory packet &amp;lt;folder&amp;gt;
stem advisory call &amp;lt;folder&amp;gt;
stem advisory check-response &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--response&lt;/span&gt; FILE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The older paths still exist for compatibility:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem &amp;lt;folder&amp;gt;
stem audit &amp;lt;folder&amp;gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--tier-gate&lt;/span&gt; T2 &lt;span class="nt"&gt;--quiet&lt;/span&gt;
stem &amp;lt;folder&amp;gt; &lt;span class="nt"&gt;--advisory&lt;/span&gt; packet
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But they are no longer the conceptual center.&lt;/p&gt;

&lt;p&gt;That matters more than it sounds.&lt;/p&gt;

&lt;p&gt;Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.&lt;/p&gt;

&lt;p&gt;This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Repository trust needed four separate lanes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8cnkkkskgtmge5xk6hwe.png" alt="Repository trust needed four separate lanes" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was the biggest architectural shift.&lt;/p&gt;

&lt;p&gt;I stopped treating repository trust as one object.&lt;/p&gt;

&lt;p&gt;In practice, it needed four separate lanes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;deterministic structural scoring&lt;/li&gt;
&lt;li&gt;deterministic diagnostics&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;optional AI advisory&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If all of those collapse into one final confidence score, the tool becomes harder to reason about.&lt;/p&gt;

&lt;p&gt;The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.&lt;/p&gt;

&lt;p&gt;Some evidence should change the score.&lt;br&gt;
Some evidence should only raise review priority.&lt;br&gt;
Some evidence should support traceability.&lt;br&gt;
Some evidence should be handed to a human or advisory system.&lt;/p&gt;

&lt;p&gt;The maturity of the tool is not that it sees all of them.&lt;/p&gt;

&lt;p&gt;The maturity is that it does not confuse them.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;This separation is not just conceptual. It exists in the code path.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?&lt;/p&gt;

&lt;p&gt;In STEM BIO-AI, the answer is visible in the execution order.&lt;/p&gt;

&lt;p&gt;The scanner computes the formal score first. In the result object, that means keys like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage 1&lt;/li&gt;
&lt;li&gt;Stage 2R&lt;/li&gt;
&lt;li&gt;Stage 3&lt;/li&gt;
&lt;li&gt;risk penalty&lt;/li&gt;
&lt;li&gt;score cap&lt;/li&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does it append the non-scoring layers, again as explicit result keys:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;reasoning_model&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;optional &lt;code&gt;ai_advisory&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That ordering matters.&lt;/p&gt;

&lt;p&gt;The score is not derived from the advisory lane.&lt;br&gt;
The regulatory mapping does not mutate the formal tier.&lt;br&gt;
The diagnostics lane can emit evidence without becoming a hidden score multiplier.&lt;/p&gt;

&lt;p&gt;This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.&lt;/p&gt;

&lt;p&gt;That execution order is the architectural reason the next four sections exist.&lt;/p&gt;

&lt;p&gt;Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.&lt;/p&gt;

&lt;p&gt;Put differently, the next four sections answer four different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is allowed to change the formal tier&lt;/li&gt;
&lt;li&gt;what is useful enough to emit, but not yet mature enough to score&lt;/li&gt;
&lt;li&gt;what can support regulatory review without pretending to be compliance&lt;/li&gt;
&lt;li&gt;what can involve AI without letting AI become the scoring authority&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;1. Deterministic structural scoring&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjtrypiyfrasqki89isbj.png" alt="The official baseline for triage" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This remains the official score and tier.&lt;/p&gt;

&lt;p&gt;It measures the main repository-visible signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README evidence&lt;/li&gt;
&lt;li&gt;repo-local consistency&lt;/li&gt;
&lt;li&gt;code and bio responsibility&lt;/li&gt;
&lt;li&gt;dependency hygiene&lt;/li&gt;
&lt;li&gt;changelog and provenance surfaces&lt;/li&gt;
&lt;li&gt;code-integrity patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This lane is local, deterministic, and machine-checkable.&lt;/p&gt;

&lt;p&gt;That is the part that can legitimately drive a formal triage tier.&lt;/p&gt;

&lt;p&gt;I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.&lt;/p&gt;

&lt;p&gt;I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;2. Deterministic diagnostics&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;This is where the deterministic diagnostics spec became important.&lt;/p&gt;

&lt;p&gt;I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;docs/DETERMINISTIC_DIAGNOSTICS.md&lt;/code&gt; defines.&lt;/p&gt;

&lt;p&gt;It separates the diagnostic problem into two lanes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lane A: deterministic local diagnostics&lt;/li&gt;
&lt;li&gt;Lane B: optional AI-assisted semantic review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That separation is central.&lt;/p&gt;

&lt;p&gt;The deterministic lane is authoritative for hard findings.&lt;br&gt;
The AI lane is advisory only.&lt;/p&gt;

&lt;p&gt;The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed or suspicious SMILES-like outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;li&gt;silent mock or simulated-data fallbacks&lt;/li&gt;
&lt;li&gt;risky subprocess construction around bio tools&lt;/li&gt;
&lt;li&gt;traceability manifest surfaces&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point was not to create a “bio slop detector” with a catchy label.&lt;/p&gt;

&lt;p&gt;The point was to create a local evidence lane that could say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;here is the file&lt;/li&gt;
&lt;li&gt;here is the line&lt;/li&gt;
&lt;li&gt;here is the snippet&lt;/li&gt;
&lt;li&gt;here is the bounded interpretation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is much more useful than a vague semantic warning.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why diagnostics stayed evidence-only
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4uyzvdequdx3f2juhj58.png" alt="Retaining evidence without inflating the score" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was one of the harder engineering decisions.&lt;/p&gt;

&lt;p&gt;It would have been easy to push every new bio-specific detector directly into the final score.&lt;/p&gt;

&lt;p&gt;I did not do that.&lt;/p&gt;

&lt;p&gt;The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's &lt;code&gt;evidence_ledger&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;findings are emitted into the result object’s &lt;code&gt;evidence_ledger&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings appear in Markdown and &lt;code&gt;--explain&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;findings do not change &lt;code&gt;final_score&lt;/code&gt; or &lt;code&gt;formal_tier&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the right default.&lt;/p&gt;

&lt;p&gt;For example, the SMILES lane can be very useful for detecting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;malformed surface strings&lt;/li&gt;
&lt;li&gt;low-entropy placeholders&lt;/li&gt;
&lt;li&gt;repeated trivial outputs&lt;/li&gt;
&lt;li&gt;missing parser guards&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it does not prove:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;medicinal usefulness&lt;/li&gt;
&lt;li&gt;synthetic feasibility&lt;/li&gt;
&lt;li&gt;binding plausibility&lt;/li&gt;
&lt;li&gt;biological efficacy&lt;/li&gt;
&lt;li&gt;full chemical validity in every edge case&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That boundary is important.&lt;/p&gt;

&lt;p&gt;A detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;3. Regulatory traceability&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftesjgssomvk8xj4t0sye.png" alt="Traceability is not a permission slip" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second document that became central was &lt;code&gt;docs/REGULATORY_MAPPING.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This solved a different problem.&lt;/p&gt;

&lt;p&gt;Once you audit clinical-adjacent repositories, people naturally ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;does this align with EU AI Act themes?&lt;/li&gt;
&lt;li&gt;does this help with FDA-oriented review?&lt;/li&gt;
&lt;li&gt;is there anything relevant to IMDRF or SaMD evidence families?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The wrong answer would be to turn those questions into a fake compliance score.&lt;/p&gt;

&lt;p&gt;So I did the opposite.&lt;/p&gt;

&lt;p&gt;The regulatory layer is explicitly framed as:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;a traceability aid, not a compliance verdict&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That document maps observed evidence classes to requirement families with bounded confidence labels like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;strong&lt;/li&gt;
&lt;li&gt;moderate&lt;/li&gt;
&lt;li&gt;weak-moderate&lt;/li&gt;
&lt;li&gt;weak&lt;/li&gt;
&lt;li&gt;not assessed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And it makes an important distinction:&lt;/p&gt;

&lt;p&gt;the confidence applies to the mapping relationship, not to legal acceptability.&lt;/p&gt;

&lt;p&gt;Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.&lt;/p&gt;

&lt;p&gt;That means the tool can say things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;versioned manifests and changelogs may support record-keeping / traceability review&lt;/li&gt;
&lt;li&gt;intended-use and disclaimer sections may support transparency scaffolding review&lt;/li&gt;
&lt;li&gt;override interfaces may support human-oversight interface review&lt;/li&gt;
&lt;li&gt;subgroup measurement language may support weak evidence of data-governance intent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;without claiming:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;legal compliance&lt;/li&gt;
&lt;li&gt;regulatory clearance&lt;/li&gt;
&lt;li&gt;clinical certification&lt;/li&gt;
&lt;li&gt;deployer conformance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In a regulated domain, traceability is useful only when it does not pretend to be permission.&lt;/p&gt;
&lt;h3&gt;
  
  
  A concrete example: why Article 12 is traceability, not compliance
&lt;/h3&gt;

&lt;p&gt;The best example here is EU AI Act Article 12 style traceability.&lt;/p&gt;

&lt;p&gt;The regulatory mapping layer treats signals like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;changelogs&lt;/li&gt;
&lt;li&gt;checksum manifests&lt;/li&gt;
&lt;li&gt;versioned config surfaces&lt;/li&gt;
&lt;li&gt;audit-log schema fragments&lt;/li&gt;
&lt;li&gt;decision-event or override-event schema tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;as evidence that a repository may have traceability scaffolding.&lt;/p&gt;

&lt;p&gt;That is useful.&lt;/p&gt;

&lt;p&gt;It is also bounded.&lt;/p&gt;

&lt;p&gt;The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.&lt;/p&gt;

&lt;p&gt;So the output can legitimately say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;there is structural evidence relevant to traceability review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;while refusing to say:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;this system satisfies traceability obligations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is exactly the kind of distinction I wanted this lane to enforce.&lt;/p&gt;

&lt;p&gt;What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.&lt;/p&gt;
&lt;h3&gt;
  
  
  Why regulatory mapping stayed subordinate to evidence
&lt;/h3&gt;

&lt;p&gt;This was non-negotiable.&lt;/p&gt;

&lt;p&gt;Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.&lt;/p&gt;

&lt;p&gt;That is why the output shape separates things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;regulatory_basis&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stage_traceability&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;regulatory_traceability&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;from the actual score computation.&lt;/p&gt;

&lt;p&gt;And it is not just decorative structure.&lt;/p&gt;

&lt;p&gt;The regulatory basis object is registry-driven. It can mark &lt;code&gt;review_required&lt;/code&gt; when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.&lt;/p&gt;

&lt;p&gt;This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is useful.&lt;/p&gt;

&lt;p&gt;If a repo has traceability-relevant scaffolding, that is still not compliance.&lt;/p&gt;

&lt;p&gt;The distinction has to remain visible in both the code and the artifacts.&lt;/p&gt;


&lt;h2&gt;
  
  
  &lt;strong&gt;4. Optional AI advisory&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3zlifs3rxmxuwm87n8z0.png" alt="Enforcing a bounded intelligence sandbox" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The fourth lane is the advisory layer.&lt;/p&gt;

&lt;p&gt;This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.&lt;/p&gt;

&lt;p&gt;That means workflows like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo
stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;can exist without creating ambiguity about who owns the formal result.&lt;/p&gt;

&lt;p&gt;The advisory layer can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;export a provider-neutral packet&lt;/li&gt;
&lt;li&gt;validate downstream response structure&lt;/li&gt;
&lt;li&gt;enforce finding-ID citation rules&lt;/li&gt;
&lt;li&gt;reject prohibited claims&lt;/li&gt;
&lt;li&gt;surface runtime and secret boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it cannot do is silently override:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;score.final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;score.formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  How that rule is actually enforced
&lt;/h3&gt;

&lt;p&gt;This is not just policy language in the README.&lt;/p&gt;

&lt;p&gt;The advisory validator explicitly checks for score-override attempts. If a response includes fields like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;final_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;formal_tier&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_score&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;replication_tier&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;or sets &lt;code&gt;final_score_override&lt;/code&gt;, the response is marked invalid with &lt;code&gt;final_score_override_requested&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The packet contract also exports the rule in plain language:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Do not modify or override &lt;code&gt;final_score&lt;/code&gt;, &lt;code&gt;formal_tier&lt;/code&gt;, &lt;code&gt;replication_score&lt;/code&gt;, or &lt;code&gt;replication_tier&lt;/code&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And provider responses must cite exact values from &lt;code&gt;allowed_finding_ids&lt;/code&gt;; citation strings are not repaired or loosely matched later.&lt;/p&gt;

&lt;p&gt;So the advisory lane is bounded in two ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;it has no authority to change the deterministic result&lt;/li&gt;
&lt;li&gt;it cannot cite evidence outside the bounded packet&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What operational use looks like now&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnqaat0z96ca7h4e0obx8.png" alt="One execution driving distinct operator surfaces" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once these lanes were separated, the CLI became much easier to reason about.&lt;/p&gt;

&lt;p&gt;Local engineering review:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem scan /path/to/repo &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CI/CD gate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem gate /path/to/repo &lt;span class="nt"&gt;--min-tier&lt;/span&gt; T2 &lt;span class="nt"&gt;--summary&lt;/span&gt; off &lt;span class="nt"&gt;--output&lt;/span&gt; results
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Offline advisory packet generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory packet /path/to/repo &lt;span class="nt"&gt;--output&lt;/span&gt; advisory_out
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Downstream provider response validation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;stem advisory check-response /path/to/repo &lt;span class="nt"&gt;--response&lt;/span&gt; provider_advisory.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The important point is not just that these commands exist.&lt;/p&gt;

&lt;p&gt;It is that each one represents a distinct trust boundary.&lt;/p&gt;

&lt;p&gt;That made the project feel more like engineering infrastructure and less like a scoring demo.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;A real v1.6.2 packet&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of &lt;a href="https://github.com/ClawBio/ClawBio" rel="noopener noreferrer"&gt;ClawBio&lt;/a&gt;, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.&lt;/p&gt;

&lt;p&gt;The command was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python &lt;span class="nt"&gt;-m&lt;/span&gt; stem_ai.cli scan /path/to/ClawBio &lt;span class="nt"&gt;--level&lt;/span&gt; 3 &lt;span class="nt"&gt;--format&lt;/span&gt; all &lt;span class="nt"&gt;--explain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fitfu4woqxooan1vs88zb.png" alt="ClawBio_ClawBio_detailed_5p-1" width="800" height="1131"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6lqg9kj5y8v5n0vcpia.png" alt="ClawBio_ClawBio_detailed_5p-2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;On my machine, that run took about &lt;strong&gt;9.4 seconds&lt;/strong&gt; and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.&lt;/p&gt;

&lt;p&gt;Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;T0&lt;/code&gt; = 0-39&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T1&lt;/code&gt; = 40-54&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T2&lt;/code&gt; = 55-69&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T3&lt;/code&gt; = 70-84&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T4&lt;/code&gt; = 85-100&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Stage 4 replication is reported separately as its own lane, where &lt;code&gt;R2&lt;/code&gt; means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Governance note:&lt;br&gt;
This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;With that in mind, the result was:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;67 / 100&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;T2 Caution&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Replication lane: 55 / 100 (R2)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clinical adjacency: CA-DIRECT&lt;/strong&gt; (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Code integrity warnings: C2 dependency pinning, C4 exception handling&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is exactly the workflow shift I wanted the tool to support.&lt;/p&gt;

&lt;p&gt;The same deterministic scan is rendered into multiple operator surfaces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JSON for automation&lt;/li&gt;
&lt;li&gt;Markdown for review&lt;/li&gt;
&lt;li&gt;PDF for human-facing packet inspection&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--explain&lt;/code&gt; for file / line / snippet proof tracing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That output shape is only possible because the result object already separates:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;formal score and tier&lt;/li&gt;
&lt;li&gt;replication lane&lt;/li&gt;
&lt;li&gt;diagnostics lane&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;advisory boundary state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the PDF is not a separate product. It is a view over the same bounded audit object.&lt;/p&gt;

&lt;p&gt;Two details from this run are worth calling out.&lt;/p&gt;

&lt;p&gt;First, the scanner did &lt;strong&gt;not&lt;/strong&gt; manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;SMILES Surface Integrity: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES RDKit Validation: not_applicable&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SMILES Parser Guard: not_detected&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.&lt;/p&gt;

&lt;p&gt;Second, the score is strict about observable repository conventions. ClawBio uses &lt;code&gt;ClawBio_README_Repo.md&lt;/code&gt; rather than a root &lt;code&gt;README.md&lt;/code&gt;, so the scan records &lt;code&gt;S1_missing_readme: -20&lt;/code&gt;. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.&lt;/p&gt;

&lt;p&gt;That distinction matters. A &lt;code&gt;T2 Caution&lt;/code&gt; result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.&lt;/p&gt;

&lt;p&gt;And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What still has to stay bounded&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The system is better than it was, but there are still obvious next steps.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The public surface is broad
&lt;/h3&gt;

&lt;p&gt;There is now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scoring&lt;/li&gt;
&lt;li&gt;diagnostics&lt;/li&gt;
&lt;li&gt;replication&lt;/li&gt;
&lt;li&gt;advisory packeting&lt;/li&gt;
&lt;li&gt;regulatory traceability&lt;/li&gt;
&lt;li&gt;JSON / Markdown / PDF / explain outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is useful, but it increases onboarding cost.&lt;/p&gt;

&lt;p&gt;The CLI is clearer now, but the broader public surface has to stay disciplined.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. The deterministic diagnostics lane is still missing a published calibration threshold
&lt;/h3&gt;

&lt;p&gt;The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.&lt;/p&gt;

&lt;p&gt;Right now the rule is conceptually clear:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;commit-pinned fixtures&lt;/li&gt;
&lt;li&gt;reproducible detector output&lt;/li&gt;
&lt;li&gt;explicit false-positive review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The regulatory confidence labels are rule-authored, not empirically validated
&lt;/h3&gt;

&lt;p&gt;The mapping labels like &lt;code&gt;strong&lt;/code&gt;, &lt;code&gt;moderate&lt;/code&gt;, and &lt;code&gt;weak-moderate&lt;/code&gt; are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.&lt;/p&gt;

&lt;p&gt;That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Earlier context&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f"&gt;Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2"&gt;How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Try it yourself&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;STEM BIO-AI is Apache 2.0 and fully open source.&lt;/p&gt;

&lt;p&gt;If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.&lt;/p&gt;

&lt;p&gt;That is the real test.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: &lt;a href="https://github.com/flamehaven01/STEM-BIO-AI" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/STEM-BIO-AI&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;License: Apache 2.0&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Final thought&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The earlier STEM-AI posts were about why repository trust deserves its own audit layer.&lt;/p&gt;

&lt;p&gt;This phase was about something more practical:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For me, the answer was simple:&lt;/p&gt;

&lt;p&gt;Separate the workflows.&lt;br&gt;
Separate the lanes.&lt;br&gt;
Keep diagnostics evidence-first.&lt;br&gt;
Keep regulatory mapping subordinate to evidence.&lt;br&gt;
Keep advisory AI bounded.&lt;/p&gt;

&lt;p&gt;Optimize for inspectability, not just score production.&lt;/p&gt;

&lt;p&gt;That is what changed the project.&lt;/p&gt;

&lt;p&gt;Not bigger claims.&lt;/p&gt;

&lt;p&gt;Better boundaries.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6pay2ivnar9ryd22kjk.png" alt="Final thought" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>opensource</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 13:51:45 +0000</pubDate>
      <link>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</link>
      <guid>https://dev.to/flamehaven01/how-do-you-trust-the-ai-auditor-stem-ai-v112-and-memory-contracted-bio-ai-audits-1gc2</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1fj5kfblnvz80fjttvnl.png" alt=" " width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Previous article:&lt;br&gt;
&lt;a href="https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5"&gt;&lt;strong&gt;How Auditing 10 Bio-AI Repositories Shaped STEM-AI&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.&lt;/p&gt;

&lt;p&gt;The more useful lesson was this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Text-only review is too weak for bio/medical AI. You have to inspect the code path.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That worked.&lt;/p&gt;

&lt;p&gt;But it exposed the next problem.&lt;/p&gt;

&lt;p&gt;If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?&lt;/p&gt;

&lt;p&gt;LLMs drift. &lt;br&gt;
One session can enforce a clinical boundary strictly. &lt;br&gt;
Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 is my answer to that problem.&lt;/p&gt;

&lt;p&gt;It does not try to make the LLM deterministic by writing a longer prompt.&lt;/p&gt;

&lt;p&gt;It binds the audit to a memory contract.&lt;/p&gt;


&lt;h2&gt;
  
  
  What v1.1.2 adds
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frcsghndeq1guwwkoy2y7.png" alt="standard audit vs Bio/Medical AI audit" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM-AI v1.1.2 introduces &lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA: Memory-Injected Contract Architecture&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The v1.1.2 layer includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;memory/mica.yaml&lt;/code&gt; -- composition contract&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai.mica.v1.1.2.json&lt;/code&gt; -- machine-checkable memory archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-playbook.v1.1.2.md&lt;/code&gt; -- session playbook and drift guard&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;memory/stem-ai-lessons.v1.1.2.md&lt;/code&gt; -- historical failure-mode archive&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;spec/STEM-AI_v1.1.2_CORE.md&lt;/code&gt; -- canonical audit spec&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract pins 18 invariants.&lt;/p&gt;

&lt;p&gt;Examples:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.&lt;/li&gt;
&lt;li&gt;Stage weights are fixed.&lt;/li&gt;
&lt;li&gt;Tier boundaries are fixed.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; cannot be bypassed.&lt;/li&gt;
&lt;li&gt;Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Governance overlay cannot raise the formal base tier.&lt;/li&gt;
&lt;li&gt;C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.&lt;/li&gt;
&lt;li&gt;Mandatory clinical-use disclaimers cannot be omitted.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not a claim that the LLM becomes perfectly deterministic.&lt;/p&gt;

&lt;p&gt;It is a narrower claim:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the useful layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  What "loading the contract" means
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flo9k78n4gct6xfq15ls2.png" alt="Forcing the auditor to operate inside a machine-checkable memory contract" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/series/37087"&gt;MICA&lt;/a&gt;&lt;/strong&gt; is not hidden model memory.&lt;/p&gt;

&lt;p&gt;It is also not a claim that the model provider changed the LLM.&lt;/p&gt;

&lt;p&gt;In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;memory/mica.yaml
memory/stem-ai.mica.v1.1.2.json
memory/stem-ai-playbook.v1.1.2.md
memory/stem-ai-lessons.v1.1.2.md
spec/STEM-AI_v1.1.2_CORE.md
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcucfrq4bst2cxc3olb1d.png" alt="Pinning the audit rules mathematically" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The auditor then performs a pre-execution contract test:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;confirm the canonical spec exists&lt;/li&gt;
&lt;li&gt;confirm the memory archive exists&lt;/li&gt;
&lt;li&gt;confirm the invariant count is 18&lt;/li&gt;
&lt;li&gt;confirm the fixed tier boundaries are present&lt;/li&gt;
&lt;li&gt;confirm the Stage 2 / Stage 2R lane rule is present&lt;/li&gt;
&lt;li&gt;confirm Stage 3G cannot raise the formal tier&lt;/li&gt;
&lt;li&gt;confirm C1-C4 mode gating is active&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Only after that does the audit proceed.&lt;/p&gt;

&lt;p&gt;This does not make the LLM mathematically deterministic.&lt;/p&gt;

&lt;p&gt;It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.&lt;/p&gt;

&lt;p&gt;That is the difference between &lt;strong&gt;"please be consistent"&lt;/strong&gt; and &lt;strong&gt;"execute this versioned contract."&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The audit workflow
&lt;/h2&gt;

&lt;p&gt;STEM-AI v1.1.2 runs as a structured audit workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frihb9k729ll3vbqpvu3c.png" alt="STEM-AI v1.1.2 workflow" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.&lt;/p&gt;

&lt;p&gt;It can inspect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;package metadata&lt;/li&gt;
&lt;li&gt;workflow files&lt;/li&gt;
&lt;li&gt;test definitions&lt;/li&gt;
&lt;li&gt;dependency manifests&lt;/li&gt;
&lt;li&gt;source-code paths&lt;/li&gt;
&lt;li&gt;deprecated or dead-code paths&lt;/li&gt;
&lt;li&gt;exception handling&lt;/li&gt;
&lt;li&gt;credential patterns&lt;/li&gt;
&lt;li&gt;provenance and hash-checking logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is intentionally split into two files:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;report.md                  # human-readable audit judgment
experiment_results.json    # machine-readable evidence and score object
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5kvqqnty1q4d011wcyo6.png" alt="Separating subjective reasoning from verifiable mathematics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That split matters.&lt;/p&gt;

&lt;p&gt;The report explains the reasoning.&lt;/p&gt;

&lt;p&gt;The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.&lt;/p&gt;




&lt;h2&gt;
  
  
  A real target audit, not a synthetic example
&lt;/h2&gt;

&lt;p&gt;For this v1.1.2 demonstration, I used a real public repository:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/artic-network/fieldbioinformatics" rel="noopener noreferrer"&gt;artic-network/fieldbioinformatics&lt;br&gt;
&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The target is not the protagonist of this post.&lt;/p&gt;

&lt;p&gt;It is only the specimen used to show the audit workflow against a real bioinformatics codebase.&lt;/p&gt;

&lt;p&gt;The local audit produced:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;audits/fieldbioinformatics_v1_1_2/report.md
audits/fieldbioinformatics_v1_1_2/experiment_results.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The target snapshot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"remote"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://github.com/artic-network/fieldbioinformatics"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"branch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"master"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"commit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"8008b4c97c2193a82308ff6f0be507b1d9306e36"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"file_count"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;114&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the important part: the audit did not ask, "Does this README sound trustworthy?"&lt;/p&gt;

&lt;p&gt;It asked:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do README claims match actual package metadata and entry points?&lt;/li&gt;
&lt;li&gt;Are there real CI and domain-specific tests?&lt;/li&gt;
&lt;li&gt;Are dependencies reproducible enough?&lt;/li&gt;
&lt;li&gt;Are there credential leaks?&lt;/li&gt;
&lt;li&gt;Are there deprecated patient-adjacent paths?&lt;/li&gt;
&lt;li&gt;Do clinical-adjacent output paths fail closed?&lt;/li&gt;
&lt;li&gt;Does the repository include governance evidence, or only governance absence?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where STEM-AI is useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  The score object
&lt;/h2&gt;

&lt;p&gt;The machine-readable result records the score like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_1_readme_intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_cross_platform"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_repo_local_consistency"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;75&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_lane"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"STAGE_2R_REPO_LOCAL_CONSISTENCY"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_2_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"External Stage 2 was not collected; LOCAL_ANALYSIS used Stage 2R in the fixed 0.20 Stage 2 slot."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"stage_3_code_bio"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;55&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stage_3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"risk_penalty"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"final_score"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;63&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"formal_tier"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"T2 Caution"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;External Stage 2 is explicitly represented as &lt;code&gt;null&lt;/code&gt; for this local-only audit.&lt;/p&gt;

&lt;p&gt;That does not mean cross-platform consistency is unimportant.&lt;/p&gt;

&lt;p&gt;It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.&lt;/p&gt;

&lt;p&gt;Stage 2R asks whether the repository's own surfaces agree with each other:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;README vs package metadata and CLI entry points&lt;/li&gt;
&lt;li&gt;README vs docs, tutorials, and troubleshooting&lt;/li&gt;
&lt;li&gt;README test claims vs CI workflow and test definitions&lt;/li&gt;
&lt;li&gt;clinical-adjacent outputs vs local intended-use boundaries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The contract defines the fixed-weight calculation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Final = (Stage 1 x 0.40) + (Stage 2R x 0.20) + (Stage 3 x 0.40) - Risk Penalty
      = (65 x 0.40) + (75 x 0.20) + (55 x 0.40) - 0
      = 63
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The final tier is therefore:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T2 Caution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not because the prose sounded balanced.&lt;/p&gt;

&lt;p&gt;Because the contract math forces that result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the T0 hard floor did not trigger
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdntp9d5l6yysb9sbndw.png" alt="Why the T0 hard floor did not trigger" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;T0_HARD_FLOOR&lt;/code&gt; is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.&lt;/p&gt;

&lt;p&gt;In simplified form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If a repository is CA-DIRECT
and it has no substantive code implementation,
then final tier = T0 regardless of score math.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.&lt;/p&gt;

&lt;p&gt;The audited repository did not trigger that floor because STEM-AI classified it as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"clinical_adjacent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ca_severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"t0_hard_floor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.&lt;/p&gt;

&lt;p&gt;So the result is not T0.&lt;/p&gt;

&lt;p&gt;But it is also not high-trust.&lt;/p&gt;

&lt;p&gt;The bounded result is T2 Caution.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3o3nr2j7trtap0med4y.png" alt="Stem-AI Audit v1.1.2" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Code-integrity findings
&lt;/h2&gt;

&lt;p&gt;The same JSON records C1-C4 LOCAL_ANALYSIS checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C1_hardcoded_credentials"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C2_dependency_pinning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C3_dead_or_deprecated_patient_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"C4_exception_handling_clinical_adjacent_paths"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"WARN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the difference between a general review and a code-path audit.&lt;/p&gt;

&lt;p&gt;A text review can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The project appears technically mature.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A code-path audit can say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a more useful governance object.&lt;/p&gt;

&lt;p&gt;It is not a certificate.&lt;/p&gt;

&lt;p&gt;It is a map of what a reviewer should trust, distrust, or inspect next.&lt;/p&gt;




&lt;h2&gt;
  
  
  A small Python verifier
&lt;/h2&gt;

&lt;p&gt;Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#!/usr/bin/env python3
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;


&lt;span class="n"&gt;RESULT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;audits/fieldbioinformatics_v1_1_2/experiment_results.json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;54&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;69&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;84&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;T4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;width&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;█&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;░&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;filled&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;RESULT&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1_readme_intent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2_repo_local_consistency&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3_code_bio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;risk_penalty&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;risk_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;weights&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weights&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stage_3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;risk_penalty&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="n"&gt;computed&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;final_score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;assert&lt;/span&gt; &lt;span class="nf"&gt;tier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 1  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 2R &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stage 3  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stage_3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Final    &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/100  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;bar&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;computed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Tier     &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;formal_tier&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;code_integrity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected digest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Stage 1   65/100  █████████████░░░░░░░
Stage 2R  75/100  ███████████████░░░░░
Stage 3   55/100  ███████████░░░░░░░░░
Final     63/100  █████████████░░░░░░░
Tier      T2 Caution
C1_hardcoded_credentials: PASS
C2_dependency_pinning: WARN
C3_dead_or_deprecated_patient_adjacent_paths: WARN
C4_exception_handling_clinical_adjacent_paths: WARN
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Bio/medical AI governance is full of language that sounds safe but is hard to verify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"research use only"&lt;/li&gt;
&lt;li&gt;"not medical advice"&lt;/li&gt;
&lt;li&gt;"validated pipeline"&lt;/li&gt;
&lt;li&gt;"clinical-grade"&lt;/li&gt;
&lt;li&gt;"responsible AI"&lt;/li&gt;
&lt;li&gt;"human-in-the-loop"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those phrases are not enough.&lt;/p&gt;

&lt;p&gt;STEM-AI asks for observable structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;source-code reality&lt;/li&gt;
&lt;li&gt;test reality&lt;/li&gt;
&lt;li&gt;CI reality&lt;/li&gt;
&lt;li&gt;dependency reality&lt;/li&gt;
&lt;li&gt;clinical boundary reality&lt;/li&gt;
&lt;li&gt;governance artifact reality&lt;/li&gt;
&lt;li&gt;code-integrity reality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;v1.1.2 adds another layer:&lt;/p&gt;

&lt;p&gt;auditor reality.&lt;/p&gt;

&lt;p&gt;The AI auditor itself has to load a memory contract before it scores.&lt;/p&gt;

&lt;p&gt;That is what MICA is for.&lt;/p&gt;

&lt;p&gt;The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.&lt;/p&gt;

&lt;p&gt;Not hype.&lt;/p&gt;

&lt;p&gt;Not rejection by default.&lt;/p&gt;

&lt;p&gt;A bounded trust judgment with evidence paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The follow-on lane should:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;provision the target dependency environment&lt;/li&gt;
&lt;li&gt;run selected target tests in a controlled shell&lt;/li&gt;
&lt;li&gt;capture command, exit code, environment hash, and output digest&lt;/li&gt;
&lt;li&gt;attach a replay manifest to &lt;code&gt;experiment_results.json&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;keep runtime evidence separate from source/document/CI evidence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0wo50wt3x5bfg8xrip3.png" alt="Stem-AI" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thought
&lt;/h2&gt;

&lt;p&gt;STEM-AI is &lt;strong&gt;not a clinical certifier.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is also &lt;strong&gt;not trying to replace scientific review, regulatory review, or domain experts.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Its role is narrower: &lt;strong&gt;make the governance conversation start from observable evidence instead of presentation quality.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In practice, that means asking:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did the repository claim?&lt;/li&gt;
&lt;li&gt;What does the code actually implement?&lt;/li&gt;
&lt;li&gt;Do the local surfaces agree with each other?&lt;/li&gt;
&lt;li&gt;Are the tests domain-specific or merely infrastructural?&lt;/li&gt;
&lt;li&gt;Are clinical-adjacent boundaries explicit?&lt;/li&gt;
&lt;li&gt;Can the auditor's own scoring logic be inspected?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is where I think STEM-AI belongs in AI governance.&lt;/p&gt;

&lt;p&gt;Not as the final authority.&lt;/p&gt;

&lt;p&gt;As the evidence gate before authority is invoked.&lt;/p&gt;

&lt;p&gt;It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Does this repository establish enough observable trust to be considered, contained, or rejected?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>bioinformatics</category>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>ai</category>
    </item>
    <item>
      <title>Each /slop Is a Calibration Signal — AI-SLOP Detector v3.6.0 and the Claude Code Skill</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 28 Apr 2026 12:04:44 +0000</pubDate>
      <link>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</link>
      <guid>https://dev.to/flamehaven01/each-slop-is-a-calibration-signal-ai-slop-detector-v360-and-the-claude-code-skill-3909</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff2hqsg05873bhhlmli4u.png" alt="The Quiet Failure of AI Development" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;AI-assisted development has a quiet failure mode: the assistant that creates the pattern often becomes the assistant that reviews it.&lt;/p&gt;

&lt;p&gt;When you and Claude work inside the same session, you drift together. The review criteria shift with the assistant's habits. After enough sessions, the same assistant that wrote the hollow function body is also the one approving the pull request. There is no external reference point — unless you build one.&lt;/p&gt;

&lt;p&gt;That is the problem AI-SLOP Detector v3.6.0 addresses with the Claude Code skill.&lt;/p&gt;

&lt;p&gt;Every time you run &lt;code&gt;/slop&lt;/code&gt; inside a session, the scan result is recorded to a project-scoped history. When enough re-scan evidence accumulates, bounded self-calibration adjusts the detection weights for your codebase — automatically, without a manual command. The scanner does not drift with the session. It stays anchored to observed scan outcomes.&lt;/p&gt;

&lt;p&gt;It does not get smarter every time. It builds calibration signal every time. That is a more accurate claim, and the distinction matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Skill Does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu5vvu9dumj5w8b54vzd2.png" alt="The Skill layer Quality Policy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; claude-skills/slop-detector ~/.claude/skills/slop-detector
&lt;span class="c"&gt;# restart Claude Code&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four slash commands become available:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Command&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Full project scan — interprets findings, prioritizes fixes, proposes patch plan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-file [path]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Per-file deep-dive — explains each metric, gives concrete fix per pattern&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-gate&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hard gate decision — PASS or FAIL, lists blocking files with deficit_score &amp;gt;= 70&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;/slop-spar&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Adversarial validation — probes metric boundaries, catches calibration drift&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The intended workflow inside a Claude session:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. /slop               → baseline scan, identify top offenders
2. review findings     → Claude prioritizes by deficit_score
3. patch files         → fix patterns with Claude's help
4. /slop-file &amp;lt;path&amp;gt;   → verify improvement per file
5. /slop               → confirm project aggregate improved
6. /slop-gate          → gate decision before merge
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Quality policy lives in the skill layer. You do not re-explain what &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; means or which patterns are critical on every session.&lt;/p&gt;




&lt;h2&gt;
  
  
  The LEDA Flywheel
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzy5sjec49yyfddet0bqj.png" alt="The LEDA Flywheel" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the part that matters.&lt;/p&gt;

&lt;p&gt;LEDA is not model retraining. It is bounded weight calibration based on repeated scan outcomes.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;/slop&lt;/code&gt; runs &lt;code&gt;slop-detector --project . --json&lt;/code&gt; — without &lt;code&gt;--no-history&lt;/code&gt;. Every invocation auto-records results to &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;, tagged with a &lt;code&gt;project_id&lt;/code&gt; (sha256 of cwd) so signals never mix across different repositories.&lt;/p&gt;

&lt;p&gt;After every &lt;strong&gt;10 re-scanned files&lt;/strong&gt;, the tool runs the LEDA self-calibration loop automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/slop called
    │
    ├─► scan result → recorded to history.db (project-scoped)
    │
    ├─► 10 re-scanned files milestone?
    │       └─► SelfCalibrator: 4D grid-search over run history
    │               (ldr × inflation × ddc × purity weights)
    │               └─► confidence gap &amp;gt; 0.10?
    │                       └─► .slopconfig.yaml updated silently
    │
    └─► next /slop → calibrated weights, sharper detection
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The calibrator uses re-scanned files as signal — not raw record count. A file counts toward the milestone only when the tool has seen it improve or degrade across at least two runs. This prevents first-time project scans from triggering calibration on noise.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd1yf4jspidkr41hpvoe8.png" alt="Constrained to Reality" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three constraints keep calibration bounded:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Domain-anchored&lt;/strong&gt; — grid search is constrained to ±0.15 around domain baseline weights. Detection cannot drift outside the meaningful range for your project type.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence gate&lt;/strong&gt; — only applies when the top candidate weight set beats the second by &amp;gt; 0.10. Ambiguous signals produce no change.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift warnings&lt;/strong&gt; — &lt;code&gt;CalibrationResult.warnings&lt;/code&gt; flags any dimension that shifted &amp;gt; 0.25 from the anchor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;/slop-spar&lt;/code&gt; adds a separate adversarial layer: it probes known-pattern anchors, metric boundary cases, and existence conditions. When it detects that measured behavior has diverged from metric claims, it recommends &lt;code&gt;--self-calibrate --apply-calibration&lt;/code&gt; explicitly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Data Shows — and What We Won't Claim
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzbr2rhp3wc97g1teul2.png" alt="Workflow telemetry, not empty claims" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We will not tell you that AI-SLOP Detector improves code quality by X%.&lt;/p&gt;

&lt;p&gt;We have not run a controlled study. We have not compared matched projects with and without the tool. Any number we put here would be a claim we cannot prove, and this tool is built specifically to catch that kind of thing.&lt;/p&gt;

&lt;p&gt;What we do have: the tool scanning itself. Every time a core module was changed, it got re-scanned. N = 14,367 records across all projects in &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This is not outcome evidence. It is workflow telemetry. Here is what the scan history shows for the eight most-improved files in this codebase:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;File                   Scans  Worst → Best   Improvement
─────────────────────────────────────────────────────────
ddc.py                   86   87.8 →  11.0    -76.8 pts
placeholder.py           92   70.3 →   0.0    -70.3 pts
cross_file.py            89   70.3 →   5.0    -65.3 pts
ci_gate.py               88   69.3 →   6.2    -63.1 pts
cli.py                   88   68.4 →   8.4    -60.0 pts
ldr.py                   90   58.0 →   0.1    -57.9 pts
python_advanced.py       95   74.0 →  18.0    -55.9 pts
context_jargon.py        86   55.7 →   5.0    -50.7 pts
─────────────────────────────────────────────────────────
Source: self-scan, history.db — not an independent study
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the weekly project aggregate (avg deficit score):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Week      Avg Deficit   Critical Files   Note
────────────────────────────────────────────────────────
2026-W09     11.9            3           baseline
2026-W10     22.1           20           structural refactor spike
2026-W14     20.0           58           large feature addition
2026-W15     11.9           14           post-refactor recovery
2026-W17     12.2           13           current — stable CLEAN state
────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The mechanism is not mysterious. Scan reveals structural problems → Claude sees exact pattern names and line references → Claude (or the developer) fixes them → rescan confirms improvement → LEDA registers the delta and adjusts detection weights accordingly.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;Whether that loop improves your codebase is something your &lt;code&gt;history.db&lt;/code&gt; will tell you — not us.&lt;/p&gt;




&lt;h2&gt;
  
  
  Also in v3.6.0
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo24mnhsdtwi84p8wzx0l.png" alt="System diagnostics &amp;amp; Protocol refinements" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CI gate exit code fix.&lt;/strong&gt; &lt;code&gt;--ci-mode hard&lt;/code&gt; without &lt;code&gt;--ci-report&lt;/code&gt; was returning exit 0 even on &lt;code&gt;CRITICAL_DEFICIT&lt;/code&gt; files — a two-line fix in &lt;code&gt;_evaluate_ci_gate()&lt;/code&gt; (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0d67997" rel="noopener noreferrer"&gt;&lt;code&gt;0d67997&lt;/code&gt;&lt;/a&gt;). This affected v3.1.1 through v3.5.0 on the specific path of using the gate without the reporting flag. A regression test at the subprocess level was added to prevent recurrence (commit &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/commit/0208af4" rel="noopener noreferrer"&gt;&lt;code&gt;0208af4&lt;/code&gt;&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pre-commit hooks rewritten.&lt;/strong&gt; Three hook variants now use &lt;code&gt;python -m slop_detector.cli&lt;/code&gt; as entry point (bypasses Windows &lt;code&gt;.exe&lt;/code&gt; wrapper exit-code issue), and &lt;code&gt;--severity high&lt;/code&gt; (nonexistent flag) replaced with &lt;code&gt;--ci-mode&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;repos&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;repo&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;https://github.com/flamehaven01/AI-SLOP-Detector&lt;/span&gt;
    &lt;span class="na"&gt;rev&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;v3.6.0&lt;/span&gt;
    &lt;span class="na"&gt;hooks&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;slop-detector&lt;/span&gt;           &lt;span class="c1"&gt;# hard gate&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-warn    # report only&lt;/span&gt;
      &lt;span class="c1"&gt;# - id: slop-detector-patterns  # fast per-file&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;VS Code Extension v3.6.0.&lt;/strong&gt; Version tracks core library. No behavior changes from v3.5.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Shape of the Loop
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhaifh98pgwka02u110lx.png" alt="An External reference point" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The skill + LEDA loop is the external reference point. Detection weights stay grounded in observed scan outcomes — files that improved across re-scans, files that stayed problematic — rather than in what the assistant believes is correct at any given moment.&lt;/p&gt;

&lt;p&gt;The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.&lt;/p&gt;

&lt;p&gt;We won't tell you what percentage your code will improve. That would make us the thing we are trying to detect.&lt;/p&gt;

&lt;p&gt;The scanner is not Claude's opinion about code quality. It is a measurement that gets calibrated against reality, session by session. Your &lt;code&gt;history.db&lt;/code&gt; will tell you the rest.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y0o9wh1nmvimjajm12r.png" alt="The Shape of the Loop" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Links:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pypi.org/project/ai-slop-detector/" rel="noopener noreferrer"&gt;PyPI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/CLAUDE_CODE_SKILL.md" rel="noopener noreferrer"&gt;Claude Code Skill docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/docs/SELF_CALIBRATION.md" rel="noopener noreferrer"&gt;Self-Calibration docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>python</category>
      <category>claudeai</category>
      <category>codequality</category>
      <category>ai</category>
    </item>
    <item>
      <title>When an AI Pipeline Passes — But One Path Still Must Be Held: EXP-034</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 27 Apr 2026 10:09:19 +0000</pubDate>
      <link>https://dev.to/flamehaven01/when-an-ai-pipeline-passes-but-one-path-still-must-be-held-exp-034-16af</link>
      <guid>https://dev.to/flamehaven01/when-an-ai-pipeline-passes-but-one-path-still-must-be-held-exp-034-16af</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fianvzlqvv7dw9ptnldvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fianvzlqvv7dw9ptnldvx.png" alt="Cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;No efficacy, causal, or clinical claims are made in this report.&lt;/em&gt;&lt;br&gt;
RExSyn is an experimental Bio-AI governance pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;You do not need to know the earlier experiments to read this report.&lt;/p&gt;

&lt;p&gt;Most AI pipeline reports ask one question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Did the system pass?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 asked a stricter one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which path was allowed to count?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That distinction matters.&lt;/p&gt;

&lt;p&gt;In a multi-stage AI pipeline, a final &lt;code&gt;PASS&lt;/code&gt; can hide a lot of unresolved risk. A branch may be unstable. A regeneration path may drift. A new external API may enter the chain without being governed. A new modality may appear to improve the system while quietly changing the basis of judgment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt7hho1hqvwps2xp0m9p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgt7hho1hqvwps2xp0m9p.png" alt="The real result" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So EXP-034 was not designed to produce a clean success story.&lt;/p&gt;

&lt;p&gt;It was designed to separate three things:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;th&gt;Meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Anchored expansion path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GO&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accepted path for EXP-034 reporting&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Current regeneration path&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HOLD&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Diagnostic evidence, not acceptance baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Next remediation cycle&lt;/td&gt;
&lt;td&gt;&lt;code&gt;EXP-035&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;RCA and repair target&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the real result.&lt;/p&gt;

&lt;p&gt;EXP-034 passed, but not because every path passed.&lt;/p&gt;

&lt;p&gt;It passed because the accepted anchor remained stable, the expansion tracks did not break the judgment system, and the unresolved regeneration path was explicitly held instead of being silently mixed into acceptance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What EXP-034 tested
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve4hb86wepyckegmtdfz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fve4hb86wepyckegmtdfz.png" alt="Locking the Boundary" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-033 had already established a parity baseline.&lt;/p&gt;

&lt;p&gt;EXP-034 asked whether that baseline could survive controlled expansion while adding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a modal update track,&lt;/li&gt;
&lt;li&gt;a live AlphaFold EBI observer endpoint,&lt;/li&gt;
&lt;li&gt;and AlphaGenome / AG measurement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The operating rule was simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reproduce the parity baseline first.&lt;/li&gt;
&lt;li&gt;Only then allow expansion.&lt;/li&gt;
&lt;li&gt;Only then compare governance behavior across experiment cycles.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If the parity anchor breaks, the rest is not expansion.&lt;/p&gt;

&lt;p&gt;It is regression.&lt;/p&gt;

&lt;p&gt;The scope was also locked: methodology, governance, and reproducibility only. The experiment did not claim biological efficacy, causal inference, or clinical recommendation.&lt;/p&gt;

&lt;p&gt;That boundary is important because this kind of system can easily sound more powerful than what was actually measured. EXP-034 was not asking whether the pipeline discovered a better biological answer.&lt;/p&gt;

&lt;p&gt;It was asking whether the judgment system stayed governable after new signals entered the chain.&lt;/p&gt;




&lt;h2&gt;
  
  
  The key split: PASS did not mean everything passed
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsjcpcij1hext56s3s1b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftsjcpcij1hext56s3s1b.png" alt="The key split" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Track-A produced the defining decision of the experiment.&lt;/p&gt;

&lt;p&gt;The accepted legacy replay anchor preserved the required PASS/BLOCK separation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Legacy replay anchor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sample balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arm accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;arm balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dangerous false-pass rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;false reject rate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That was the path allowed to anchor EXP-034.&lt;/p&gt;

&lt;p&gt;But the current regeneration path did not recover:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Current regeneration&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sample accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sample balanced accuracy&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HOLD&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the most important part of the experiment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXP-034 did not pretend the regeneration path passed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It kept that result inside the experiment as diagnostic evidence, but did not allow it to redefine the accepted baseline.&lt;/p&gt;

&lt;p&gt;That separation is not a minor operational detail. It is the governance result.&lt;/p&gt;

&lt;p&gt;A weak pipeline would have blended the two paths and still reported a final success. EXP-034 did the opposite. It allowed the stable anchor to proceed and held the unstable path for RCA.&lt;/p&gt;

&lt;p&gt;That is how a stage-gated system avoids changing its own question after seeing the result.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why path splitting matters
&lt;/h2&gt;

&lt;p&gt;The concrete governance problem is this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A pipeline can pass for the wrong reason.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;valid_report = stable_anchor × traceable_extension × contained_instability
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the anchor is not stable, the report cannot be trusted.&lt;/p&gt;

&lt;p&gt;If the extension is not traceable, the new signal becomes an ungoverned side channel.&lt;/p&gt;

&lt;p&gt;If instability is not contained, a diagnostic failure can quietly contaminate acceptance.&lt;/p&gt;

&lt;p&gt;A single final &lt;code&gt;PASS&lt;/code&gt; is not enough when several branches contribute to a verdict. You need to know which branch produced the accepted decision, which branch failed, which branch was only diagnostic, and which branch is allowed to affect future work.&lt;/p&gt;

&lt;p&gt;EXP-034 passed because all three conditions were enforced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the legacy replay anchor held,&lt;/li&gt;
&lt;li&gt;the new observer and AG paths were measured under governance,&lt;/li&gt;
&lt;li&gt;and the regeneration HOLD remained outside acceptance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the difference between a pipeline that merely outputs a verdict and a pipeline that controls which verdicts are allowed to count.&lt;/p&gt;




&lt;h2&gt;
  
  
  Adding AlphaFold EBI as an observer, not a predictor
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F703iwj95lg12xjdtdgw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F703iwj95lg12xjdtdgw9.png" alt="Controlled Expansion" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Relative to EXP-033, EXP-034 added a live AlphaFold Protein Structure Database / EBI observer line.&lt;/p&gt;

&lt;p&gt;This was not promoted into a primary predictor.&lt;/p&gt;

&lt;p&gt;It was wired as an observer/reference oracle and traced into governance as &lt;code&gt;ebi_g2&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The result:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Check&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AlphaFold EBI direct endpoint for &lt;code&gt;P23219&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GO&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stage 7 observer tests&lt;/td&gt;
&lt;td&gt;&lt;code&gt;2 passed&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;ebi_g2&lt;/code&gt; governance traceability&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;BLOCKED_IDP&lt;/code&gt; mapping path&lt;/td&gt;
&lt;td&gt;validated in test&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The point is not simply that an external endpoint responded.&lt;/p&gt;

&lt;p&gt;The point is that the external signal entered the system through a governed path. It was not allowed to float beside the pipeline as informal context.&lt;/p&gt;

&lt;p&gt;EXP-034 tested whether the new observer could be admitted without becoming an ungoverned side channel.&lt;/p&gt;




&lt;h2&gt;
  
  
  AG-live: non-degradation, not repair
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadd3dcaw37lha9ua1zch.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fadd3dcaw37lha9ua1zch.png" alt="AG-live: non-degradation, not repair" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Track-C tested a simple question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If AG-live enters the pipeline, does it change the final decision?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The answer was no.&lt;/p&gt;

&lt;p&gt;AG-live did enter the pipeline.&lt;/p&gt;

&lt;p&gt;The AlphaGenome field was present with:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;AG field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;source&lt;/td&gt;
&lt;td&gt;&lt;code&gt;alphagenome_api_live&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;pathogenicity_score&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;confidence&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.7143&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;clinical_significance&lt;/td&gt;
&lt;td&gt;&lt;code&gt;uncertain&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are sanitized branch artifact values, not implementation code or full raw artifacts.&lt;/p&gt;

&lt;p&gt;AG-live did not change classification.&lt;/p&gt;

&lt;p&gt;Both controls remained governed by the same conservative decision boundary:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Expected&lt;/th&gt;
&lt;th&gt;Observed&lt;/th&gt;
&lt;th&gt;Interpretation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;EXP032-BLOCK-001&lt;/code&gt; negative control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK_EXPECTED&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK / ESCALATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fail-closed behavior preserved&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;EXP032-PASS-001&lt;/code&gt; pass-eligible control&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS_ELIGIBLE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK / ESCALATE&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;conservative over-blocking persisted&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the key nuance.&lt;/p&gt;

&lt;p&gt;AG-live did not create a dangerous false-pass. The negative control stayed blocked.&lt;/p&gt;

&lt;p&gt;But AG-live also did not repair the current regeneration hold. The pass-eligible control still failed to recover and remained blocked under &lt;code&gt;R2_component_floor&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The governance surface moved slightly, but the verdict did not:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Earlier AG branch&lt;/th&gt;
&lt;th&gt;AG-live branch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;p_e2e&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0912&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0947&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;clinical status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;BLOCK&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;rule&lt;/td&gt;
&lt;td&gt;&lt;code&gt;R2_component_floor&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;R2_component_floor&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;So the correct conclusion is not:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AG improved the pipeline.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The correct conclusion is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;AG-live changed the measurement surface slightly, but did not change the decision boundary.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is exactly what non-degradation means here.&lt;/p&gt;

&lt;p&gt;It preserved fail-closed behavior on the negative control while leaving the pass-eligible control over-blocked.&lt;/p&gt;

&lt;p&gt;This is why Track-C can only be called non-degradation, not repair.&lt;/p&gt;




&lt;h2&gt;
  
  
  Contract passed, but governance still blocked
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclbqwjvytk3ql600icp1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fclbqwjvytk3ql600icp1.png" alt="Contract passed, but governance still blocked" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of the most useful details in EXP-034 is that the contract layer and governance layer did not collapse into one verdict.&lt;/p&gt;

&lt;p&gt;The contract inspection reported:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;pipeline contract score&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.9077&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;weakest connection&lt;/td&gt;
&lt;td&gt;&lt;code&gt;C2&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;dangerous pass risk&lt;/td&gt;
&lt;td&gt;&lt;code&gt;0.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;gate recommendation&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;overall OK&lt;/td&gt;
&lt;td&gt;&lt;code&gt;true&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;But the clinical governance layer still blocked the case.&lt;/p&gt;

&lt;p&gt;That is not a contradiction.&lt;/p&gt;

&lt;p&gt;It means the pipeline connection was valid enough to inspect, but the decision was not safe enough to accept.&lt;/p&gt;

&lt;p&gt;This distinction matters.&lt;/p&gt;

&lt;p&gt;A weaker system might treat a passing contract as permission to pass the whole output. EXP-034 did not do that. It allowed the contract layer to say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The pipeline is connected.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;while the governance layer could still say:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The claim should not pass.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That separation is exactly what a governance layer is supposed to preserve.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cross-cycle comparison: EXP-032 → EXP-033 → EXP-034
&lt;/h2&gt;

&lt;p&gt;Track-D compared the accepted anchor path across cycles.&lt;/p&gt;

&lt;p&gt;You do not need the earlier experiments as background. They matter here for one reason only:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;EXP-034 was not allowed to invent a new success criterion.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;EXP-032 and EXP-033 provided the previous PASS/BLOCK baseline. EXP-034 tested whether that baseline survived expansion.&lt;/p&gt;

&lt;p&gt;The classification baseline stayed fixed:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Compare&lt;/th&gt;
&lt;th&gt;Accuracy / balanced accuracy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EXP-032 → EXP-034&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0 / 1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EXP-033 → EXP-034&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1.0 / 1.0&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At the same time, governance signals moved:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Governance signal&lt;/th&gt;
&lt;th&gt;Delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ccge_p_e2e_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+0.04447488775996111&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nnsl_sr9_tech_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;+0.04692394788063081&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;nnsl_di2_tech_mean&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;-0.03667940951579321&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The interpretation is narrow:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The judgment baseline stayed fixed while the governance surface became more measurable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is what EXP-034 was allowed to claim.&lt;/p&gt;

&lt;p&gt;It did not prove biological efficacy.&lt;/p&gt;

&lt;p&gt;It did not prove that every branch of the system was now stable.&lt;/p&gt;

&lt;p&gt;It proved that controlled expansion could happen without breaking the accepted PASS/BLOCK baseline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Stage-gate result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0jedqsq4ivhh5du6an.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyt0jedqsq4ivhh5du6an.png" alt="Cross-cycle comparison" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 ended with all five stage gates passing:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gate&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;G1 parity&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G2 reproducibility&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G3 cross-experiment compare&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G4 governance traceability&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;G5 extension safety&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Final state:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Field&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;overall status&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PASS&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;anchor mode&lt;/td&gt;
&lt;td&gt;&lt;code&gt;legacy_replay&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;first failed gate&lt;/td&gt;
&lt;td&gt;&lt;code&gt;null&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;diagnostic hold&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Track-A current regeneration&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This is the important nuance:&lt;/p&gt;

&lt;p&gt;The experiment passed with a retained diagnostic hold.&lt;/p&gt;

&lt;p&gt;That is not a contradiction. It is the point of the control system.&lt;/p&gt;

&lt;p&gt;The accepted anchor path was allowed to proceed. The current regeneration path was not. The remediation target was moved to EXP-035.&lt;/p&gt;

&lt;p&gt;That separation is the actual proof EXP-034 provides: not that every branch became stable, but that instability was not allowed to contaminate acceptance.&lt;/p&gt;




&lt;h2&gt;
  
  
  What EXP-034 actually showed
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69c63nfb2kt1gso8awp2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F69c63nfb2kt1gso8awp2.png" alt="What EXP-034 actually showed" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;EXP-034 did not show that the entire pipeline is now stable.&lt;/p&gt;

&lt;p&gt;It showed something narrower and more useful:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A method-locked Bio-AI governance pipeline can admit modal expansion, AlphaFold EBI observer wiring, and AG-live measurement without losing its accepted PASS/BLOCK baseline — while keeping the unstable regeneration path out of acceptance.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Track-C sharpened that conclusion.&lt;/p&gt;

&lt;p&gt;AG-live entered.&lt;br&gt;&lt;br&gt;
Metrics moved slightly.&lt;br&gt;&lt;br&gt;
The verdict did not change.&lt;br&gt;&lt;br&gt;
Dangerous false-pass did not appear.&lt;br&gt;&lt;br&gt;
Conservative over-blocking remained.&lt;/p&gt;

&lt;p&gt;That is not a clean success story.&lt;/p&gt;

&lt;p&gt;It is a governed result.&lt;/p&gt;


&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rj4k1d45y6vyolfgksu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0rj4k1d45y6vyolfgksu.png" alt="The Mark of a Mature AI Pipeline" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Stage-gated experimentation is not just about getting a result.&lt;/p&gt;

&lt;p&gt;It is about deciding whether the result should be allowed to exist.&lt;/p&gt;

&lt;p&gt;In EXP-034, the answer was:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GO   for the anchored expansion path
HOLD for current regeneration
NEXT for EXP-035 remediation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That may sound less dramatic than a clean success story.&lt;/p&gt;

&lt;p&gt;But in governance work, that is exactly the point.&lt;/p&gt;

&lt;p&gt;A mature AI pipeline is not the one that claims everything passed.&lt;/p&gt;

&lt;p&gt;It is the one that can say:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This path passed.&lt;br&gt;&lt;br&gt;
This path did not.&lt;br&gt;&lt;br&gt;
And we did not mix them.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>reproducibility</category>
      <category>governance</category>
      <category>ai</category>
    </item>
    <item>
      <title>FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 20 Apr 2026 14:27:37 +0000</pubDate>
      <link>https://dev.to/flamehaven01/flamehaven-filesearch-why-this-rag-engine-feels-different-from-the-usual-stack-e83</link>
      <guid>https://dev.to/flamehaven01/flamehaven-filesearch-why-this-rag-engine-feels-different-from-the-usual-stack-e83</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa8waxadewljs6a47aqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqa8waxadewljs6a47aqw.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack
&lt;/h2&gt;

&lt;p&gt;RAG is no longer an exotic idea.&lt;/p&gt;

&lt;p&gt;At this point, most developers have seen the familiar stack:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;parser&lt;/li&gt;
&lt;li&gt;chunker&lt;/li&gt;
&lt;li&gt;embeddings&lt;/li&gt;
&lt;li&gt;vector store&lt;/li&gt;
&lt;li&gt;LLM&lt;/li&gt;
&lt;li&gt;framework wrapper&lt;/li&gt;
&lt;li&gt;demo query&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not the interesting part anymore.&lt;/p&gt;

&lt;p&gt;The interesting part is what happens after the diagram:&lt;br&gt;
how much infrastructure the stack quietly demands, how much of the retrieval path is actually auditable, how much of the system is still mechanical rather than opaque, and how much operational tax the user is forced to absorb just to get a search engine running.&lt;/p&gt;

&lt;p&gt;That is where &lt;strong&gt;FLAMEHAVEN FileSearch&lt;/strong&gt; gets more interesting than the usual "another RAG repo" framing.&lt;/p&gt;

&lt;p&gt;This is not a feature announcement. It is a technical look at what the project is actually doing differently.&lt;/p&gt;


&lt;h2&gt;
  
  
  The real problem with many RAG stacks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyo70b5r5bwbjlsh76s0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqyo70b5r5bwbjlsh76s0.png" alt="Most RAG systems are assembly instructions" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of RAG systems are not products. They are assembly instructions.&lt;/p&gt;

&lt;p&gt;They give you flexibility, but they also leave you responsible for stitching together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file parsing&lt;/li&gt;
&lt;li&gt;chunking strategy&lt;/li&gt;
&lt;li&gt;embeddings&lt;/li&gt;
&lt;li&gt;lexical retrieval&lt;/li&gt;
&lt;li&gt;semantic retrieval&lt;/li&gt;
&lt;li&gt;answer generation&lt;/li&gt;
&lt;li&gt;attribution&lt;/li&gt;
&lt;li&gt;storage&lt;/li&gt;
&lt;li&gt;auth&lt;/li&gt;
&lt;li&gt;monitoring&lt;/li&gt;
&lt;li&gt;caching&lt;/li&gt;
&lt;li&gt;deployment&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is fine if you want a blank canvas.&lt;/p&gt;

&lt;p&gt;It is less fine if what you actually want is a document search engine that can be deployed without turning the setup itself into a second project.&lt;/p&gt;

&lt;p&gt;That is the first reason this repo feels different: it is trying to compress more of that surface area into one codebase.&lt;/p&gt;


&lt;h2&gt;
  
  
  What is technically different here
&lt;/h2&gt;
&lt;h2&gt;
  
  
  1) Hybrid retrieval is treated as the baseline, not the upgrade path
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnojxzw992pkl5t22sme2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnojxzw992pkl5t22sme2.png" alt="compressing the stack" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of RAG repos still behave as if semantic retrieval is the main event and lexical matching is an optional add-on.&lt;/p&gt;

&lt;p&gt;That is backwards for real document systems.&lt;/p&gt;

&lt;p&gt;FLAMEHAVEN FileSearch builds around three explicit modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;keyword&lt;/li&gt;
&lt;li&gt;semantic&lt;/li&gt;
&lt;li&gt;hybrid&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The interesting part is the hybrid path itself.&lt;/p&gt;

&lt;p&gt;The retrieval stack combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;BM25&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Reciprocal Rank Fusion (RRF)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;Korean + English tokenizer&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;a &lt;strong&gt;lazy per-store BM25 rebuild path&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That last point matters more than it sounds. The BM25 index is not eagerly rebuilt on every upload. It is marked dirty (&lt;code&gt;_bm25_dirty&lt;/code&gt;) and rebuilt on first hybrid search after mutation. That is a very practical decision. It keeps ingestion cheaper without pretending indexing is free.&lt;/p&gt;

&lt;p&gt;This is one of the deeper differences from many vector-first RAG demos: the system does not assume semantic retrieval should dominate exact-match behavior. It assumes production search needs both.&lt;/p&gt;


&lt;h2&gt;
  
  
  2) The indexing model is not just "document in, chunks out"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qk2weat5bt8415ovnzc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qk2weat5bt8415ovnzc.png" alt="The knowledgeatom hierarchy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The second meaningful difference is the indexing granularity.&lt;/p&gt;

&lt;p&gt;This repo introduces a &lt;strong&gt;KnowledgeAtom&lt;/strong&gt; layer: a two-level indexing model with&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;file-level documents&lt;/li&gt;
&lt;li&gt;chunk-level atoms&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those chunk atoms are not anonymous fragments. They carry stable fragment URIs of the form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;local://store/encoded_path#c0001
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
`&lt;/p&gt;

&lt;p&gt;That design solves two very common problems at once:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;precision retrieval&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;stable attribution&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The file-level object remains available, but the system can also retrieve chunk-level units directly. That reduces the usual gap between "the document matched" and "the relevant passage was actually isolated."&lt;/p&gt;

&lt;p&gt;The URI choice matters too. A lot of local-first search code still uses basename-style references that collide the moment two files share a name. This repo moves to a reversible, quoted absolute-path-based URI namespace (&lt;code&gt;urllib.parse.quote(abs_path, safe='')&lt;/code&gt;), which is much less fragile.&lt;/p&gt;

&lt;p&gt;That is not marketing polish. That is retrieval hygiene.&lt;/p&gt;




&lt;h2&gt;
  
  
  3) The chunking path is internal, structured, and mechanical
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F383b95htzfjgqjf66bt7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F383b95htzfjgqjf66bt7.png" alt="Internal tow-pass chunking" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Another place where this codebase differs is that it does not outsource the core text pipeline by default.&lt;/p&gt;

&lt;p&gt;Instead of treating chunking as a thin wrapper around an external library, it implements an internal text chunker with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;heading-boundary splitting&lt;/li&gt;
&lt;li&gt;paragraph splitting&lt;/li&gt;
&lt;li&gt;sentence fallback for oversized blocks&lt;/li&gt;
&lt;li&gt;undersized chunk merging (default minimum: 64 tokens)&lt;/li&gt;
&lt;li&gt;token-aware chunk sizing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The chunking system is actually two-pass under the hood. The structure-aware &lt;code&gt;TextChunker&lt;/code&gt; handles the document splits above. On top of that, &lt;code&gt;KnowledgeAtom&lt;/code&gt; applies a second windowing pass when generating chunk embeddings — 800-character windows, 120-character overlap, and an 80-character minimum before a fragment is dropped. These two paths are separate by design: &lt;code&gt;TextChunker&lt;/code&gt; is responsible for semantic structure, &lt;code&gt;KnowledgeAtom&lt;/code&gt; for granular embedding units.&lt;/p&gt;

&lt;p&gt;The engine also ships a &lt;code&gt;ContextExtractor&lt;/code&gt; — a sliding-window utility that can enrich each chunk with text from its neighboring chunks before retrieval. It is fully tested, but it is not yet wired into the default ingestion path. It is available for downstream pipeline extension.&lt;/p&gt;

&lt;p&gt;So the pipeline architecture is:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
document&lt;br&gt;
  → structure-aware split (TextChunker)&lt;br&gt;
  → chunk atom embedding (KnowledgeAtom, 800-char windows)&lt;br&gt;
  → multi-level indexing&lt;br&gt;
  → retrieval&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That is a better-shaped pipeline for document search than a naive chunk list.&lt;/p&gt;




&lt;h2&gt;
  
  
  4) The vector path is trying to remove operational weight, not add it
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz03a8czoace2fhqm5t6o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz03a8czoace2fhqm5t6o.png" alt="zero-dependency vectorization" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is probably the most unusual architectural choice in the repo.&lt;/p&gt;

&lt;p&gt;Instead of anchoring everything around a heavyweight embedding model stack, the project uses &lt;strong&gt;Gravitas Vectorizer v2.0&lt;/strong&gt;, a deterministic vectorization path built on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hybrid feature extraction (word tokens + character n-grams)&lt;/li&gt;
&lt;li&gt;signed feature hashing for collision mitigation&lt;/li&gt;
&lt;li&gt;SHA-256 based deterministic output&lt;/li&gt;
&lt;li&gt;no torch, no transformers, no model download&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The trade-off is obvious: this is not trying to win a leaderboard as a giant foundation-model embedding backend.&lt;/p&gt;

&lt;p&gt;That is not the point.&lt;/p&gt;

&lt;p&gt;The point is that it makes the semantic path much cheaper to deploy, easier to reason about, and viable in environments where "just load another model" is operationally the wrong answer.&lt;/p&gt;

&lt;p&gt;Technically, that shows up in several ways:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;deterministic vector generation&lt;/li&gt;
&lt;li&gt;cold start under 1ms&lt;/li&gt;
&lt;li&gt;no ML framework dependency in the core vector path&lt;/li&gt;
&lt;li&gt;optional NumPy acceleration with pure-Python fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, the semantic layer is being treated as infrastructure, not as a permanent excuse to expand infrastructure.&lt;/p&gt;

&lt;p&gt;That is rare.&lt;/p&gt;




&lt;h2&gt;
  
  
  5) The repo is explicit about local-first and multi-provider execution
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a0kn73rqdv8bv1fja40.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7a0kn73rqdv8bv1fja40.png" alt="Architecture and provider abstraction" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A lot of document search systems quietly assume one provider path.&lt;/p&gt;

&lt;p&gt;This repo does not.&lt;/p&gt;

&lt;p&gt;The provider layer supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini&lt;/li&gt;
&lt;li&gt;OpenAI&lt;/li&gt;
&lt;li&gt;Anthropic&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;OpenAI-compatible endpoints&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That matters for two reasons.&lt;/p&gt;

&lt;p&gt;First, it keeps the system from being hardwired to one hosted model assumption.&lt;/p&gt;

&lt;p&gt;Second, it means the retrieval stack and the answer stack are not collapsed into the same dependency decision.&lt;/p&gt;

&lt;p&gt;That is an important architectural separation.&lt;/p&gt;

&lt;p&gt;For non-Gemini providers, the code takes a provider-RAG route: local semantic retrieval first, then prompt construction, then model answer generation. That is a much more honest design than pretending all providers support the same retrieval semantics natively.&lt;/p&gt;

&lt;p&gt;The local Ollama path is especially relevant. Not because "local" is fashionable, but because self-hosted document search is often most attractive precisely when data boundary control matters more than marginal model quality gains.&lt;/p&gt;




&lt;h2&gt;
  
  
  6) The codebase has been refactored toward narrower responsibilities
&lt;/h2&gt;

&lt;p&gt;One of the easiest ways to tell whether a repo is becoming more operationally serious is to look at whether the core orchestrator is shrinking or swelling.&lt;/p&gt;

&lt;p&gt;Here, the architecture moved in the right direction.&lt;/p&gt;

&lt;p&gt;The central &lt;code&gt;core.py&lt;/code&gt; was split into focused mixins:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;IngestMixin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;LocalSearchMixin&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CloudSearchMixin&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is not just aesthetic cleanup.&lt;/p&gt;

&lt;p&gt;It clarifies the system boundary between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ingestion&lt;/li&gt;
&lt;li&gt;local retrieval/orchestration&lt;/li&gt;
&lt;li&gt;provider-backed answer generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The same pattern appears elsewhere:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;BackendRegistry&lt;/code&gt; maps file extensions to parser classes via &lt;code&gt;register()&lt;/code&gt; — new formats plug in without modifying existing dispatch logic&lt;/li&gt;
&lt;li&gt;duplicate helper blocks were pulled out of cloud search paths&lt;/li&gt;
&lt;li&gt;file parsing was reduced to dispatch instead of a single giant extractor module&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These changes do not make a flashy screenshot.&lt;/p&gt;

&lt;p&gt;They do make the code easier to maintain without quietly reintroducing the same complexity elsewhere.&lt;/p&gt;

&lt;p&gt;That is a real engineering improvement.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark snapshot
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh872w7a8wjwp9zw8p060.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh872w7a8wjwp9zw8p060.png" alt="Benchmark snapshot" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;System profile&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gravitas Vectorizer v2.0 (deterministic DSP, zero ML deps)&lt;/li&gt;
&lt;li&gt;ChronosGrid vector backend with quantized storage (int8)&lt;/li&gt;
&lt;li&gt;BM25 + RRF hybrid retrieval&lt;/li&gt;
&lt;li&gt;Local / pgvector backends&lt;/li&gt;
&lt;li&gt;Redis cache optional&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Documented performance figures&lt;/strong&gt; (Docker, Apple M1, 500 PDFs ~2GB)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Vector generation: &lt;code&gt;&amp;lt;1ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Search, cache hit: &lt;code&gt;9ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Search, cache miss (includes Gemini API round-trip): &lt;code&gt;1,250ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Batch search (10 queries, parallel): &lt;code&gt;2,500ms&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Upload, 50MB file with indexing: &lt;code&gt;3,200ms&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What matters more than the numbers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The cache-hit figure reflects the full path when semantic and lexical retrieval are served from warm indexes.&lt;/p&gt;

&lt;p&gt;The cache-miss figure is dominated by the Gemini API round-trip, not local retrieval.&lt;/p&gt;

&lt;p&gt;The performance story here is not just raw speed. It is that the repo achieves low-latency local retrieval by reducing dependency weight and simplifying the vector path, rather than by hiding heavy infrastructure behind abstraction.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  A comparison that is actually worth making
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjehh80hs06t3tvd3ffz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjehh80hs06t3tvd3ffz.png" alt="A comparison that is actually worth making" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The wrong comparison is:&lt;/p&gt;

&lt;p&gt;"Is this the best RAG framework?"&lt;/p&gt;

&lt;p&gt;That is too vague to be useful.&lt;/p&gt;

&lt;p&gt;The better comparison is architectural.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Main idea&lt;/th&gt;
&lt;th&gt;Common weakness&lt;/th&gt;
&lt;th&gt;Why this repo differs&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Framework-only RAG stack&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Compose your own parser, retriever, vector store, and generator&lt;/td&gt;
&lt;td&gt;High assembly burden; a lot of operational logic is still your job&lt;/td&gt;
&lt;td&gt;This repo packages more of the retrieval, ingestion, attribution, and serving path together&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Hosted RAG / SaaS search&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fastest time to first demo&lt;/td&gt;
&lt;td&gt;External data boundary, vendor coupling, recurring service assumptions&lt;/td&gt;
&lt;td&gt;This repo keeps self-hosted and local-first execution as first-class options&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Vector-first DIY pipeline&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Semantic retrieval drives everything&lt;/td&gt;
&lt;td&gt;Lexical exactness and attribution often become second-class&lt;/td&gt;
&lt;td&gt;This repo treats hybrid retrieval as the practical default&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FLAMEHAVEN FileSearch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Retrieval + ingestion + serving compressed into one engine&lt;/td&gt;
&lt;td&gt;Less of a blank canvas than a raw framework stack&lt;/td&gt;
&lt;td&gt;Better fit for teams that want a mechanical, deployable search base instead of another assembly project&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That is the actual niche.&lt;/p&gt;

&lt;p&gt;Not "RAG but louder."&lt;/p&gt;

&lt;p&gt;More like:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG with a lower operational tax.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;The RAG field has cooled compared to its peak hype cycle.&lt;/p&gt;

&lt;p&gt;That is not a bad thing.&lt;/p&gt;

&lt;p&gt;It means the novelty premium is lower, and the real questions are clearer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Can it be deployed?&lt;/li&gt;
&lt;li&gt;Can it run without a side quest in infrastructure?&lt;/li&gt;
&lt;li&gt;Can it keep data local?&lt;/li&gt;
&lt;li&gt;Can it support both lexical precision and semantic recall?&lt;/li&gt;
&lt;li&gt;Can its retrieval behavior be inspected rather than mythologized?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why a repo like this becomes more interesting now than it would have been in the most hype-saturated phase of the RAG wave.&lt;/p&gt;

&lt;p&gt;When everything is new, wrappers are enough.&lt;/p&gt;

&lt;p&gt;When the field matures, the differentiator becomes whether the system removes real engineering burden.&lt;/p&gt;

&lt;p&gt;This one is at least trying to solve that problem directly.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is special about the code, specifically
&lt;/h2&gt;

&lt;p&gt;If I had to reduce the repo's technical distinctiveness to a short list, it would be this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;BM25 + RRF is built in&lt;/strong&gt;, not bolted on later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;KnowledgeAtom indexing&lt;/strong&gt; gives the system a more precise retrieval unit than document-only search&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stable chunk URIs&lt;/strong&gt; (&lt;code&gt;local://store/enc_path#c0001&lt;/code&gt;) make attribution less fragile&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Two-pass chunking&lt;/strong&gt; — structure-aware TextChunker + char-window KnowledgeAtom embedding pass — keeps the text pipeline mechanical and inspectable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gravitas Vectorizer v2.0&lt;/strong&gt; reduces startup cost and dependency sprawl (zero torch/transformers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Provider abstraction&lt;/strong&gt; separates retrieval architecture from model vendor choice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixin segmentation and BackendRegistry pattern&lt;/strong&gt; show a codebase moving away from monolithic orchestration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is why this repo feels different from the usual RAG stack.&lt;/p&gt;

&lt;p&gt;Not because it claims magic.&lt;/p&gt;

&lt;p&gt;Because it makes several practical decisions that many RAG repos defer, externalize, or ignore.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest boundary
&lt;/h2&gt;

&lt;p&gt;This is not a claim that the repo solves everything.&lt;/p&gt;

&lt;p&gt;It does not.&lt;/p&gt;

&lt;p&gt;And the codebase itself shows that.&lt;/p&gt;

&lt;p&gt;Static inspection still flags complexity hotspots in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;api.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;admin_routes.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;eval_self.py&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;chronos_grid.py&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There are also components that exist in the engine but are not yet connected to the default pipeline — &lt;code&gt;ContextExtractor&lt;/code&gt; being the clearest example. The architecture is there; the wiring is not yet complete everywhere.&lt;/p&gt;

&lt;p&gt;That is actually a good thing for a write-up like this, because it keeps the claim honest.&lt;/p&gt;

&lt;p&gt;The interesting story here is not "perfect codebase."&lt;/p&gt;

&lt;p&gt;It is:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;a repo with a real architectural point of view, a recognizably lower dependency burden, and code decisions that are meaningfully different from the usual vector-wrapper pattern.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is a much stronger claim than vague "enterprise-grade RAG" language.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final take
&lt;/h2&gt;

&lt;p&gt;FLAMEHAVEN FileSearch is interesting because it is not merely trying to make retrieval work.&lt;/p&gt;

&lt;p&gt;It is trying to make retrieval:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more mechanical&lt;/li&gt;
&lt;li&gt;more local&lt;/li&gt;
&lt;li&gt;more attributable&lt;/li&gt;
&lt;li&gt;less dependency-heavy&lt;/li&gt;
&lt;li&gt;and less painful to deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is a better differentiator than "supports RAG."&lt;/p&gt;

&lt;p&gt;Most repositories do.&lt;/p&gt;

&lt;p&gt;The more important question now is whether they reduce the actual engineering burden around RAG, or just rearrange it.&lt;/p&gt;

&lt;p&gt;This repo is interesting because it appears to reduce some of it in code.&lt;/p&gt;

&lt;p&gt;And in a field where many projects now converge into the same parser + vector store + model + wrapper pattern, that is a difference worth paying attention to.&lt;/p&gt;




&lt;h2&gt;
  
  
  Repository
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/Flamehaven-Filesearch" rel="noopener noreferrer"&gt;https://github.com/flamehaven01/Flamehaven-Filesearch&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>rag</category>
      <category>opensource</category>
      <category>architecture</category>
    </item>
    <item>
      <title>AI-SLOP Detector v3.5.0 — Every Claim, Verified Against Source Code</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:19:37 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</guid>
      <description>&lt;p&gt;I published a LinkedIn post about AI-SLOP Detector's self-calibration system and download numbers. Someone asked the reasonable question: &lt;strong&gt;"Can you actually back that up?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Here's the source.&lt;/p&gt;

&lt;p&gt;This isn't a feature announcement. It's a line-by-line audit of seven claims against the actual codebase. Every VERDICT links to a real file and real line numbers. The repo is public — go check it yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What was claimed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Every scan is recorded&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repeat scans become calibration signal&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Updates only when signal is strong enough&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visible policy artifact (&lt;code&gt;.slopconfig.yaml&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit numeric limits govern calibration&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detects empty/stub/phantom/disconnected code&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~1.4K downloads last week&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All seven. No fabrications. No inflated numbers. Here's the proof.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 1: "Every scan is recorded"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 116–180&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auto-invoked on every CLI run. The only opt-out is &lt;code&gt;--no-history&lt;/code&gt;. Each scan writes to SQLite at &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt; and stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;deficit_score&lt;/code&gt;, &lt;code&gt;ldr_score&lt;/code&gt;, &lt;code&gt;inflation_score&lt;/code&gt;, &lt;code&gt;ddc_usage_ratio&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_critical_patterns&lt;/code&gt;, &lt;code&gt;fired_rules&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git_commit&lt;/code&gt;, &lt;code&gt;git_branch&lt;/code&gt;, &lt;code&gt;project_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schema is now at v5, auto-migrated on startup through every release from v2.9.0 to v3.5.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The record() call is real. The schema is versioned. The behavior is not optional.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 2: "Every re-scan becomes signal"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 221–246&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_files_with_multiple_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Only files scanned &amp;gt;= 2 times count as calibration events
&lt;/span&gt;    &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="n"&gt;GROUP&lt;/span&gt; &lt;span class="n"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;HAVING&lt;/span&gt; &lt;span class="nc"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 301–309&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_load_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;by_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_group_runs_by_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single-scan files produce no calibration events. Only repeat scans generate &lt;code&gt;improvement&lt;/code&gt; or &lt;code&gt;fp_candidate&lt;/code&gt; labels. The threshold is hardcoded in SQL, not assumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The repeat-scan requirement is enforced at the query level, not in documentation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 3: "Updates only when the signal is strong enough"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54 (constants) and 251–262 (enforcement)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;   &lt;span class="c1"&gt;# min gap between #1 and #2 candidate
&lt;/span&gt;&lt;span class="n"&gt;MIN_IMPROVEMENTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;       &lt;span class="c1"&gt;# improvement events required
&lt;/span&gt;&lt;span class="n"&gt;MIN_FP_CANDIDATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;      &lt;span class="c1"&gt;# fp_candidate events required
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 1 — confidence gap check (line 251):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insufficient_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Confidence gap &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;lt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Candidates are too close — need more history data for reliable calibration.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  &lt;span class="c1"&gt;# NO UPDATE APPLIED
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 2 — score delta check (line 262):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;winner_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# also does not apply
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two independent guards. Both must pass before any weight update applies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Ambiguous signal is rejected twice before touching configuration.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 4: "Leaves behind a visible policy every time it changes"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, docstring line 17–18&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Return CalibrationResult&lt;span class="p"&gt;;&lt;/span&gt; optionally write to .slopconfig.yaml via &lt;span class="nt"&gt;--apply-calibration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;--apply-calibration&lt;/code&gt; is passed and &lt;code&gt;status == "ok"&lt;/code&gt;, optimal weights are written to &lt;code&gt;.slopconfig.yaml&lt;/code&gt;. Plain-text YAML. Human-readable. Git-versionable. Every calibration change is a diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The policy artifact is explicit. You can &lt;code&gt;git blame&lt;/code&gt; it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 5: "Explicit limits govern calibration"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MIN_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;             &lt;span class="c1"&gt;# minimum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;             &lt;span class="c1"&gt;# maximum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_PURITY_WEIGHT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# purity ceiling
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_TOLERANCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;  &lt;span class="c1"&gt;# max per-dimension deviation from domain anchor
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_DRIFT_LIMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# warn when optimal weight drifts this far
&lt;/span&gt;&lt;span class="n"&gt;GRID_STEP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;             &lt;span class="c1"&gt;# 0.05 increment resolution
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No ML model. No learned bounds. Every constraint is a named constant with a comment explaining why it exists. The calibration space is a bounded grid, not an open optimization landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Every limit is auditable. Nothing is opaque.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 6: "Detects empty implementations, phantom dependencies, disconnected pipelines"
&lt;/h2&gt;

&lt;p&gt;These are the three canonical defect patterns AI code generation produces at scale. Each has a dedicated module.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defect class&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Empty/stub functions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ldr.py&lt;/code&gt; — LDRCalculator detects &lt;code&gt;pass&lt;/code&gt;, &lt;code&gt;...&lt;/code&gt;, &lt;code&gt;raise NotImplementedError&lt;/code&gt;, &lt;code&gt;TODO&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phantom/unused imports&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/hallucination_deps.py&lt;/code&gt; — AST-based import vs usage analysis via &lt;code&gt;HallucinatedDependency&lt;/code&gt; dataclass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disconnected pipelines&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ddc.py&lt;/code&gt; — DDC (Declared Dependency Completeness) usage ratio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Function clone clusters&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/patterns/python_advanced.py&lt;/code&gt; — Jensen-Shannon Divergence on 30-dim AST histograms, JSD &amp;lt; 0.05 = clone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The clone detection is worth noting. JSD on AST histograms catches structural duplication that string similarity misses entirely. LLMs produce a lot of this — same function logic, slightly renamed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Each defect class has a named module with a working implementation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 7: "~1.4K downloads in the past week"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: pypistats.org API (&lt;code&gt;mirrors=false&lt;/code&gt;), queried 2026-04-15&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;last_week&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;1,407  (mirrors excluded — actual pip install traffic)&lt;/span&gt;
&lt;span class="na"&gt;last_month&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1,787&lt;/span&gt;
&lt;span class="na"&gt;last_day&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;83&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"~1.4K" is within 0.5% of 1,407. Mirrors excluded means bot traffic is stripped — these are real install invocations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Verified against pypistats in real time. The number is not rounded up.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this format exists
&lt;/h2&gt;

&lt;p&gt;Most open-source project posts make claims. Few back them up with file paths and line numbers.&lt;/p&gt;

&lt;p&gt;That gap is the same problem AI-SLOP Detector is built to close. AI-generated code makes claims too — functions that look complete, imports that look used, pipelines that look connected. Static analysis finds the gap between what the code says and what it does.&lt;/p&gt;

&lt;p&gt;This post applies the same standard to the project's own marketing copy. If a claim can be verified, it should be. If it can't, it shouldn't be made.&lt;/p&gt;

&lt;p&gt;The codebase is public: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;github.com/flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pull requests welcome. Audits welcome more.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verified by static code analysis + pypistats API, 2026-04-15&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitools</category>
      <category>opensource</category>
      <category>codequality</category>
      <category>python</category>
    </item>
    <item>
      <title>It Gets Smarter Every Scan: AI-SLOP Detector v3.5.0 and the Self-Calibration Loop</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:15:49 +0000</pubDate>
      <link>https://dev.to/flamehaven01/it-gets-smarter-every-scan-ai-slop-detector-v350-and-the-self-calibration-loop-3fia</link>
      <guid>https://dev.to/flamehaven01/it-gets-smarter-every-scan-ai-slop-detector-v350-and-the-self-calibration-loop-3fia</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ombaq79ho65mgbtjqyg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ombaq79ho65mgbtjqyg.png" alt="cover" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; 🔻&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n"&gt;v3.1.0 — Three Formula Refinements and the Adversarial Tester That Found Them&lt;/a&gt; · &lt;br&gt;
🔻&lt;a href="https://dev.to/flamehaven01/the-tool-that-turned-on-itself-ai-slop-detector-v290-v291-3oc4"&gt;v2.9.0/v2.9.1 — The Tool That Turned On Itself&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By late 2025, everyone was building with AI. A weekend was enough to launch a SaaS app, and by Monday it was already on Product Hunt. The code looked finished, the UI worked, and the demo landed. That was also the problem.&lt;/p&gt;

&lt;p&gt;In 2026, some of the consequences started arriving in public. Exposed databases, weak security boundaries, brittle automation, and production systems that looked polished enough to ship but had clearly not been understood at the level their surface confidence implied. Not every one of those failures belongs to static analysis, and it would be too easy to pretend otherwise. But many of them still point to the same upstream condition: code that looks complete long before it deserves trust.&lt;/p&gt;

&lt;p&gt;That is the layer this release is about.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breach is the headline. The review gap is the story.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ec93in1avitflyntlvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ec93in1avitflyntlvp.png" alt="Structurally plausible, functionally thin" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A missing security rule is not the same thing as a stubbed auth function. A runtime-only bug is not the same thing as a phantom import. &lt;/p&gt;

&lt;p&gt;A broken architecture is not the same thing as a buzzword-heavy helper. These are different failure classes, and any serious tool has to respect that difference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag337gfzcfw1zdjiancq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag337gfzcfw1zdjiancq.png" alt="output scales while oversight stagnates" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What they often share, though, is the review environment that let them through. AI increased output volume, increased speed, and increased surface polish. &lt;/p&gt;

&lt;p&gt;Review depth did not increase with it. That matters because AI-generated code has a very recognizable habit: it often looks complete before it is complete.&lt;/p&gt;

&lt;p&gt;It compiles. It passes tests. It sounds like it knows what it is doing. Then you open the function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_quality_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Advanced multi-dimensional quality assessment using
    proprietary algorithms with statistical normalization,
    entropy-based weighting, and dynamic threshold calibration.
    Returns a score between 0 and 100.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# TODO: implement the actual algorithm
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;85.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not noisy code. It is confident emptiness. In an analytics path, it becomes false certainty. In a payment path, it becomes a defect. In an auth path, it becomes risk. &lt;/p&gt;

&lt;p&gt;The issue is not that AI writes ugly code. The issue is that AI reliably produces code that is structurally plausible while functionally thin.&lt;/p&gt;

&lt;p&gt;That is a narrower claim than “AI is dangerous,” but it is also far more useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  We ran into this ourselves
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc91ne3b4usfndqmn5p0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc91ne3b4usfndqmn5p0.png" alt="4-dimensional weighted geometric mean" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This did not begin as a theory about other people’s repos. It began when we found a flaw in our own scoring model. Back in v2.8.0, we discovered that our formula was accidentally rewarding spaghetti code. &lt;/p&gt;

&lt;p&gt;A large god function could sometimes look healthier than a small clean function because complexity was dividing the penalty instead of amplifying it.&lt;/p&gt;

&lt;p&gt;That was backwards, so the math changed.&lt;/p&gt;

&lt;p&gt;AI-SLOP Detector now evaluates four dimensions: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LDR&lt;/strong&gt; for logic density
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inflation&lt;/strong&gt; for jargon density relative to real logic
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDC&lt;/strong&gt; for dependency usage rather than dependency presence
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purity&lt;/strong&gt; for critical structural defects that should drag the whole score down
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are combined with a &lt;strong&gt;weighted geometric mean&lt;/strong&gt;, not an arithmetic average.&lt;/p&gt;

&lt;p&gt;Why that matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one strong-looking axis should not be able to hide a collapsed one
&lt;/li&gt;
&lt;li&gt;a polished docstring should not rescue empty logic
&lt;/li&gt;
&lt;li&gt;if one important dimension fails, the whole score should feel it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the scoring philosophy underneath the tool. But even that was not enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Static analyzers have a threshold problem
&lt;/h2&gt;

&lt;p&gt;Take a perfectly legitimate ML helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prepare_training_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;raw_samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PreTrainedTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Tokenize and pad samples for transformer training.
    Handles attention mask generation and HuggingFace
    tokenizer conventions for batch encoding.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_samples&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is nothing wrong with this code. But a generic detector may still overreact, because terms like &lt;code&gt;tokenizer&lt;/code&gt;, &lt;code&gt;attention mask&lt;/code&gt;, and &lt;code&gt;HuggingFace&lt;/code&gt; can look suspicious if the analyzer does not understand the domain it is scanning. In a real ML codebase, those terms are normal. In a CRUD backend, some of them may be genuine anomaly signals.&lt;/p&gt;

&lt;p&gt;That is the threshold problem. The same threshold can be wrong in one codebase and exactly right in another. A universal threshold sounds elegant, but real repositories are local. They have habits, idioms, and boilerplate that are legitimate inside one domain and suspicious inside another.&lt;/p&gt;

&lt;p&gt;So the next problem became obvious: the tool had to learn the project it was scanning. That is the real center of v3.5.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI-SLOP Detector actually does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1e70mgohgkb3xwldop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1e70mgohgkb3xwldop.png" alt="scanning for structrural integrity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP Detector&lt;/a&gt; is a static analyzer built to catch a specific defect class that shows up repeatedly in AI-generated code: unimplemented stubs, disconnected pipelines, phantom imports, clone-shaped emptiness, placeholder-heavy production paths, and jargon inflation that outruns the actual logic. It is not a style linter, not a full security scanner, and not a runtime verifier. It is a detector for &lt;strong&gt;structural hollowness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That distinction matters because it keeps the claim honest. The tool is not trying to solve every production risk. It is trying to catch one layer that becomes more expensive as AI output scales faster than human review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector
slop-detector &lt;span class="nt"&gt;--init&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The workflow is the product story
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2ilk1rr0i2hozydca4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2ilk1rr0i2hozydca4o.png" alt="why universal rules fail real repositories" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What makes this release interesting is that it is not just “more patterns” or “more language support.” It is a workflow story.&lt;/p&gt;

&lt;p&gt;The detector now has a real loop. It scans the file, classifies its role, computes the 4D score, applies structural pattern penalties, and writes the result to history. Then, once enough repeated scans exist, it revisits that history, extracts behavioral signals, tunes the weights inside bounded domain-aware limits, updates the configuration, and keeps scanning. That is the release.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fq4x5gkdfp5mwe5nfcl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fq4x5gkdfp5mwe5nfcl.png" alt="mermaid1" width="800" height="1345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That final stretch is what changed this from “detector upgrade” into “adaptive detector.” The tool no longer only evaluates code. It also learns from what happens after evaluation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-calibration is the real headline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql1dkuxvotgmm6bnzc4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql1dkuxvotgmm6bnzc4x.png" alt="Mechanical self-calibration" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every scan is recorded to a local SQLite history database. That history is not just there for reporting. It becomes the signal surface for the next tuning step. Once enough repeated scans accumulate, the detector begins asking a simple question: when this file was flagged, what happened next?&lt;/p&gt;

&lt;p&gt;That produces two behavior-derived event types. An &lt;strong&gt;improvement event&lt;/strong&gt; means the file was flagged, later changed, and its deficit dropped meaningfully. A &lt;strong&gt;false-positive candidate&lt;/strong&gt; means the file was flagged, then scanned again with the same content and little meaningful score movement.&lt;/p&gt;

&lt;p&gt;That difference is more important than it sounds. A lot of “self-improving” systems quietly learn from their own outputs. They mark something suspicious, then later use that same judgment as the truth signal for tuning. The system becomes better at agreeing with itself. That is not calibration. That is self-imitation with cleaner packaging.&lt;/p&gt;

&lt;p&gt;v3.5.0 tries to avoid that trap. Its labels are not taken from the scoring formula. They are inferred from developer behavior around repeated scans. The formula says, “this looks suspicious.” The next run reveals whether a human treated that suspicion as real.&lt;/p&gt;

&lt;p&gt;That signal is not perfect. An unchanged file is not always a false positive. It may be legacy code, low priority, or simply out of scope. But it is still a healthier signal than teaching the formula to imitate its own prior outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the loop actually looks like
&lt;/h2&gt;

&lt;p&gt;The loop is not mystical. It is mechanical. Repeated scans accumulate, improvement and likely-FP events are extracted, candidate weight sets are evaluated, the search is bounded around the project’s current domain anchor, and if a strong enough winner appears, the config gets updated. If a calibrated weight drifts too far from the domain anchor, the system emits a warning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfpv792zuufs9nm0czgz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfpv792zuufs9nm0czgz.png" alt="mermaid2" width="800" height="2640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is what makes the title true. It gets smarter every scan, not because a hidden model is hallucinating taste, but because repeated use creates a bounded feedback loop. That is much less magical, and much more trustworthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;--init&lt;/code&gt; matters more now
&lt;/h2&gt;

&lt;p&gt;There is another reason the calibration story works better in v3.5.0. The detector no longer starts from a generic nowhere. &lt;code&gt;--init&lt;/code&gt; now performs domain-aware bootstrap, detects the likely project type, and seeds the starting weights accordingly. That means calibration starts near the right neighborhood instead of wandering across the whole map.&lt;/p&gt;

&lt;p&gt;That improves the first week of use, not just the tenth. And that matters, because bad first impressions kill adaptive tools. If the detector is only smart after a month of annoying you, it will never survive long enough to get smart.&lt;/p&gt;

&lt;p&gt;Good initialization is not a convenience feature. It is part of whether the loop can gather clean signal at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  JS, TS, and Go are not side quests
&lt;/h2&gt;

&lt;p&gt;v3.5.0 also expands analysis coverage to Go, JS, JSX, TS, and TSX. That is useful on its own, but the deeper significance is architectural. Structurally hollow AI-generated code is not a Python-only phenomenon. If the detector’s long-term direction is project-local calibration rather than one-size-fits-all scoring, then wider language support is not a side feature. It is the natural expansion of the same idea.&lt;/p&gt;

&lt;p&gt;Different languages. Same review gap. Same loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest boundary
&lt;/h2&gt;

&lt;p&gt;This tool still does &lt;strong&gt;not&lt;/strong&gt; close every gap. It will not fix missing infrastructure controls, catch every runtime bug, prove the architecture is correct, or replace security review. A clean structural profile is not proof of safety.&lt;/p&gt;

&lt;p&gt;What it can do is narrow one expensive blind spot: the distance between code that &lt;strong&gt;looks finished&lt;/strong&gt; and code that carries enough actual logic to deserve confidence. That is a smaller claim than “AI risk solved,” but it is also the kind of claim that survives production better.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;AI has made software generation dramatically cheaper. It has not made understanding cheaper. That difference is where governance debt begins to accumulate.&lt;/p&gt;

&lt;p&gt;If teams can now generate far more code than they can truly review, then the review stack needs tools that operate below style and above syntax. Not tools that ask whether the code is pretty, but tools that ask whether the implementation carries enough substance for the confidence wrapped around it.&lt;/p&gt;

&lt;p&gt;That is the space AI-SLOP Detector is trying to occupy. Not the whole problem. Just one layer that became impossible to ignore.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector
&lt;span class="nb"&gt;cd &lt;/span&gt;my-project/
slop-detector &lt;span class="nt"&gt;--init&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix what clearly deserves fixing. Leave legitimate idioms alone. Then keep scanning. If the loop is doing its job, the next pass should know your codebase a little better than the first one did.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;GitHub: flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devtool</category>
      <category>architecture</category>
      <category>ai</category>
      <category>governance</category>
    </item>
    <item>
      <title>Can AI Review Physics? Yes — That Is Why We Built SPAR</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Sun, 12 Apr 2026 09:34:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/can-ai-review-physics-yes-that-is-why-we-built-spar-1ojk</link>
      <guid>https://dev.to/flamehaven01/can-ai-review-physics-yes-that-is-why-we-built-spar-1ojk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A standalone review framework for checking whether outputs deserve the claims attached to them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most review systems answer a familiar question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Did the system still produce the expected output?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SPAR is built for a narrower and more dangerous one:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does the output still deserve the claim attached to it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is the split. Not reliability alone, but &lt;strong&gt;admissibility&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practical terms, admissibility means &lt;strong&gt;claim-worthiness&lt;/strong&gt;: whether a result justifies the interpretation, governance status, or scientific statement built on top of it.&lt;/p&gt;

&lt;p&gt;A system can be reliable and inadmissible at the same time.&lt;/p&gt;

&lt;p&gt;A physics engine can compute &lt;code&gt;beta_G_norm&lt;/code&gt;, return zero, pass regression, and stay green across the whole pipeline. The report can still say the background is admissible. But if the function producing &lt;code&gt;beta_G_norm&lt;/code&gt; is a stub that always returns zero, the output is stable while the claim attached to it is false.&lt;/p&gt;

&lt;p&gt;That is not hypothetical. It is one concrete class of review failure SPAR was designed to surface.&lt;/p&gt;




&lt;h2&gt;
  
  
  What SPAR Is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F140n2kz9fwvslapkmc6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F140n2kz9fwvslapkmc6m.png" alt="What SPAR Is" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SPAR (Sovereign Physics Autonomous Review)&lt;/strong&gt; is a deterministic framework for &lt;strong&gt;claim-aware review&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It does not replace unit tests. It does not replace regression benchmarks. It does not replace scoring systems. It reviews a different object:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the output&lt;/li&gt;
&lt;li&gt;the claim attached to that output&lt;/li&gt;
&lt;li&gt;the implementation state behind it&lt;/li&gt;
&lt;li&gt;the maturity state that should travel with it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SPAR started inside Flamehaven-TOE, an open physics simulation and AI governance engine. It has since been extracted into a &lt;strong&gt;standalone open-source framework&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/flamehaven01/SPAR-Framework" rel="noopener noreferrer"&gt;github.com/flamehaven01/SPAR-Framework&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The framework includes a generic review kernel, explicit score and verdict policy, registry-backed review surfaces, and a first domain adapter for physics. Physics is the first adapter and the first domain where this review model was stress-tested. It is not the limit of the framework.&lt;/p&gt;

&lt;p&gt;The core idea is simpler than the name: &lt;strong&gt;an output can pass while the claim attached to it drifts.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Ordinary Review Is Not Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fke30jyn0seb4vvlwpp0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fke30jyn0seb4vvlwpp0i.png" alt="Reliability ≠ Truth" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ordinary review is usually shallow by necessity. It asks questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did the code run&lt;/li&gt;
&lt;li&gt;did the output shape stay valid&lt;/li&gt;
&lt;li&gt;did the score remain within bounds&lt;/li&gt;
&lt;li&gt;did regression remain green&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are necessary checks. They are not always enough.&lt;/p&gt;

&lt;p&gt;The failure SPAR cares about is not, in the first instance, a crash. It is not even always a wrong number. It is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the code executes&lt;/li&gt;
&lt;li&gt;the output looks plausible&lt;/li&gt;
&lt;li&gt;the tests pass&lt;/li&gt;
&lt;li&gt;and the interpretation is still overstated or structurally false&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That failure can appear in several ways: a placeholder implementation returns stable-looking values; a maturity registry stays stale after the implementation improves; a score looks smooth before its epistemic basis is strong enough to justify the interpretation attached to it; an approximation gets reported as closure.&lt;/p&gt;

&lt;p&gt;None of these failures is spectacular. That is exactly why they are easy to miss.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Minimal Divergence
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe774mdtlkafs2zktlob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe774mdtlkafs2zktlob.png" alt="Ordinary Review vs SPAR" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The clearest way to see the difference is in review form.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ordinary_regression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kernel_exec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_suite_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GREEN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"spar_review"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONSISTENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_b"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONSISTENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_c"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GAP_STATE_MISMATCH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Implementation path is genuine; registry classification remains stale"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"required_action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RECLASSIFY"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ordinary regression says the system still works. SPAR says the system may no longer be describing its own computation truthfully.&lt;/p&gt;

&lt;p&gt;SPAR is not "tests, but harsher." It can produce a &lt;strong&gt;different review outcome&lt;/strong&gt; even when ordinary regression remains green. In this case, the required action is not rejection. It is reclassification. That is not the same as testing harder. It is reviewing a different object.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Mismatch Classes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bx0i61pzmsgs7ad123q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bx0i61pzmsgs7ad123q.png" alt="What We Catch" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPAR treats three mismatch classes as first-class review objects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor mismatch&lt;/strong&gt; — the output conflicts with a declared analytical or contractual anchor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interpretation mismatch&lt;/strong&gt; — the report language claims more than the implementation state justifies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maturity mismatch&lt;/strong&gt; — the implementation, registry, and outward-facing claim have drifted out of sync.&lt;/p&gt;

&lt;p&gt;Ordinary review mostly checks whether a system still passes. SPAR checks whether the result is still being described honestly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Layer Structure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6awaygug9q9gpiivqu4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6awaygug9q9gpiivqu4.png" alt="Deterministic, Not LLM-Judged" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer A — Anchor Consistency
&lt;/h3&gt;

&lt;p&gt;Layer A checks whether output agrees with a declared analytical or contractual anchor. The expected value is not "whatever the engine produced last time." It is "what the declared contract says must appear for this background, under this formulation."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Flat Minkowski: beta residuals must vanish
&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CONSISTENT&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;beta_G&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;ANOMALY&lt;/span&gt;

&lt;span class="c1"&gt;# de Sitter: admissibility gate must FAIL
# A PASS here indicates a gate defect, not a success.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Layer A tests agreement with a declared reference contract — not truth in some unconstrained universal sense. Analytical anchors depend on regime, normalization, and formulation. That distinction matters. Still, the engineering value is clear: reliability can remain intact while anchor-consistency fails. A Layer A anomaly means the output contradicts the contract the system claims to be using.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer B — Interpretation Validity
&lt;/h3&gt;

&lt;p&gt;Layer B checks whether the interpretation attached to the output stays within declared scope. This layer is &lt;strong&gt;deterministic&lt;/strong&gt; — it does not rely on a free-form LLM judge. It uses explicit rule tables over structured runtime artifacts, maturity states, and report text.&lt;/p&gt;

&lt;p&gt;Typical checks: does the report claim full closure while the path is still heuristic or partial; is a bounded approximation being described as exact; is an environment-conditional bridge being written up as universal; are overclaim phrases appearing where runtime state does not support them.&lt;/p&gt;

&lt;p&gt;Layer B does not eliminate semantic ambiguity. What it does is narrow the problem from "solve rhetoric in general" to "enforce explicit admissibility contracts against declared model states." That makes it auditable. Not complete. Auditable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer C — Existence and Maturity Probes
&lt;/h3&gt;

&lt;p&gt;Layer C asks what kind of implementation produced the result: genuine, approximate, gapped, environment-conditional, or research-only.&lt;/p&gt;

&lt;p&gt;This is where SPAR becomes especially different from ordinary review. It does not merely score outputs. It checks the &lt;strong&gt;ontological status&lt;/strong&gt; of the path that produced them. A result from a known-limited path is not the same thing as a result from a genuine path. A research probe is not production-grade closure. A dependency-bound bridge is not a universal capability. Those distinctions change what the output is allowed to claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Registry Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtrpjd5figmbutwl658a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtrpjd5figmbutwl658a.png" alt="Machine-Readable Governance" width="800" height="447"&gt;&lt;/a&gt;&lt;br&gt;
A framework like this needs more than score outputs. It needs structured state that can travel with the result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ARCHITECTURE_GAPS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C4_sidrce_omega&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SIDRCE Omega (v4.5+): chi-squared Gaussian model. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score = exp(-chi2/2), chi2 = (||beta||/tol)^2. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Derived from zero-centered Gaussian likelihood. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GAP: tolerance values are calibration parameters with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no first-principles derivation of their exact magnitude.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C8_rg_linearized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RG flow: linearized dilaton ODE only. Metric does NOT flow. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROXIMATION: valid only for small perturbations &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;around a fixed background geometry.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A machine-readable registry turns caveat prose into runtime surface. That lets review results carry explicit maturity labels — &lt;code&gt;open&lt;/code&gt;, &lt;code&gt;partial&lt;/code&gt;, &lt;code&gt;closed&lt;/code&gt;, &lt;code&gt;environment_conditional&lt;/code&gt;, &lt;code&gt;research_only&lt;/code&gt; — rather than vague prose. Without that surface, approximation and closure collapse into the same sentence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scoring Policy Is Explicit
&lt;/h2&gt;

&lt;p&gt;SPAR keeps score policy visible. No hidden learned weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SCORE_TABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANOMALY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# contradicts an analytical anchor
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# review-layer failure
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GAP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# honest gap disclosure
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# bounded concern
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROXIMATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# known simplification
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;journal_verdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer_a&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;count_anomalies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layer_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REJECT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MINOR REVISION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAJOR REVISION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REJECT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are not laws of nature. They are review policy. A hidden learned scorer may feel sophisticated. An explicit policy is easier to inspect, debate, and change.&lt;/p&gt;

&lt;p&gt;Two or more Layer A anomalies trigger unconditional REJECT regardless of total score. Mathematical contract failures are not averaged away by cleaner signals elsewhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Example: The Omega Score Transition
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh145cyc50hcjmaxkk4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh145cyc50hcjmaxkk4c.png" alt="The Omega Score Transition" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flamehaven-TOE's primary governance metric, SIDRCE Omega, once relied on a large arbitrary multiplicative constant applied to the raw residual. The outputs looked stable. Nothing felt obviously broken.&lt;/p&gt;

&lt;p&gt;SPAR still flagged it. Stability was not the right question. The stronger question was whether the formula justified the interpretation being attached to it. A free scaling constant with no physical derivation is not the same thing as a physically motivated model.&lt;/p&gt;

&lt;p&gt;The formula was replaced with a chi-squared Gaussian construction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;gs_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_G_norm&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_G&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;b_score&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_B_norm&lt;/span&gt;  &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;si_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_Phi_norm&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_Phi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;omega_physics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;si_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That change matters because it introduces a reversible relation to the underlying residual:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;||beta|| = tol * sqrt(-2 * ln(score))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Given a reported score, the residual is recoverable. The score is no longer just a presentation layer. It encodes a falsifiable relation to the quantity beneath it.&lt;/p&gt;

&lt;p&gt;SPAR did not respond by declaring the problem solved. It updated the classification precisely: the formula ceased to be arbitrary, but the remaining gap shifted to the tolerance scales, which are still calibration parameters. That is a narrower and more honest claim than either "arbitrary" or "fully resolved."&lt;/p&gt;

&lt;p&gt;That is exactly the kind of distinction SPAR is built to review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; .[dev]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spar_framework.engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_review&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spar_domain_physics.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_review_runtime&lt;/span&gt;

&lt;span class="n"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_review_runtime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_G_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_B_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_Phi_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sidrce_omega&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eft_m_kk_gev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0e16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ricci_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flat minkowski&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;report_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bounded report text.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_registry_snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The review result carries more than pass/fail: a verdict, a score, and a maturity-aware review surface. That surface is what makes the output governable rather than just evaluable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Framework Fits
&lt;/h2&gt;

&lt;p&gt;Physics remains the first adapter and strongest early testbed. The review pattern is broader.&lt;/p&gt;

&lt;p&gt;It fits anywhere outputs can pass while the attached claim can drift: scientific computing pipelines, PDE and simulation workflows, scientific ML surrogates, inverse and calibration models, AI code review, model governance, regulated analytics and reporting.&lt;/p&gt;

&lt;p&gt;That does &lt;strong&gt;not&lt;/strong&gt; mean every team needs the full framework. Often the first useful step is smaller.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Lightweight Adoption Path
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkia6qy78ou1b90gohdut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkia6qy78ou1b90gohdut.png" alt="Pragmatic Adoption Path" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1 — Claim Check.&lt;/strong&gt; Add three explicit questions to an existing workflow: What is the output actually claiming? Does that claim match the implementation state? Is this result exact, approximate, partial, or heuristic? Most teams can do this immediately with no new tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2 — Maturity Labels.&lt;/strong&gt; Attach state labels to results: &lt;code&gt;heuristic&lt;/code&gt;, &lt;code&gt;partial&lt;/code&gt;, &lt;code&gt;closed&lt;/code&gt;, &lt;code&gt;environment_conditional&lt;/code&gt;. A small registry. Already a meaningful step beyond ordinary review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3 — Full SPAR.&lt;/strong&gt; Layer A anchor consistency, Layer B interpretation validity, Layer C existence and maturity probes, registry-backed snapshots, explicit score and verdict policy.&lt;/p&gt;

&lt;p&gt;SPAR can be used as a review habit before it is adopted as a full framework.&lt;/p&gt;




&lt;h2&gt;
  
  
  What SPAR Does Not Do
&lt;/h2&gt;

&lt;p&gt;SPAR does not provide a universal truth engine, free-form LLM judging in the core, domain contracts inside the generic kernel, or certainty about whether a scientific claim is true in all possible senses.&lt;/p&gt;

&lt;p&gt;SPAR is not a machine for declaring truth. Its narrower goal is to make &lt;strong&gt;claim drift reviewable&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reliability Is Not Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk32aoci0joww2uhbcqyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk32aoci0joww2uhbcqyz.png" alt="Make Claim Drift Reviewable" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Reliability asks whether a system produces stable, repeatable outputs. Admissibility asks whether those outputs deserve the meanings attached to them.&lt;/p&gt;

&lt;p&gt;A stub that always returns zero can be reliable. A heuristic threshold can be reliable. A smoothly calibrated score can be reliable. None of those facts alone makes the resulting claim justified.&lt;/p&gt;

&lt;p&gt;Current AI and scientific tooling are already better at measuring reliability than admissibility. That asymmetry is understandable — reliability is easier to benchmark, easier to automate, easier to ship in CI. But admissibility is where silent approximations, overstated claims, and maturity mismatches accumulate.&lt;/p&gt;

&lt;p&gt;SPAR is one working answer to that problem. Not a universal answer. A technical one.&lt;/p&gt;

&lt;p&gt;It turns implementation state, maturity state, analytical anchoring, and scope honesty into review objects that can travel with the result.&lt;/p&gt;

&lt;p&gt;That is why the architecture may matter outside the domain that produced it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/SPAR-Framework" rel="noopener noreferrer"&gt;github.com/flamehaven01/SPAR-Framework&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/WHAT_IS_SPAR.md" rel="noopener noreferrer"&gt;What Is SPAR&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/ADMISSIBILITY.md" rel="noopener noreferrer"&gt;Admissibility&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/PHYSICS_PROOF_CASE.md" rel="noopener noreferrer"&gt;Physics as the Proof Case&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/USE_CASES.md" rel="noopener noreferrer"&gt;Use Cases&lt;/a&gt;&lt;/p&gt;

</description>
      <category>governance</category>
      <category>ai</category>
      <category>verification</category>
      <category>computerscience</category>
    </item>
    <item>
      <title>AI SLOP Detector v3.1: Three Formula Refinements and the Adversarial Tester That Found Them</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:29:57 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n</guid>
      <description>&lt;p&gt;We shipped v2.9.0 with a scoring engine we trusted. We ran tests. Everything passed.&lt;/p&gt;

&lt;p&gt;Then we built a tool specifically designed to find cases where the score was &lt;em&gt;less precise than it could be&lt;/em&gt; — and it found three.&lt;/p&gt;

&lt;p&gt;This is the story of v3.1.0. And the patch that followed six hours later.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmla75gj6xobi0was3209.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmla75gj6xobi0was3209.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Glossary — internal terminology used throughout this post
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deficit score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The final output of the scorer. 0 = structurally clean, 100 = critical. Derived as &lt;code&gt;100 × (1 - GQG)&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GQG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Geometric Quality Gate. A weighted geometric mean of LDR, Inflation quality, DDC, and Purity. The single formula the scorer evaluates.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LDR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logic Density Ratio. Ratio of executable logic lines to total lines. Low LDR = file is mostly stubs, blanks, or comments.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inflation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric that flags jargon-heavy docstrings unsupported by actual code complexity. A 2-line function with a 30-line docstring using 12 buzzwords scores badly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dead/Duplicate Code ratio. Tracks unreachable paths, copy-pasted blocks, phantom imports.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern hit rate. How many structural anti-patterns (god functions, stub returns, nested complexity) fire on the file.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cyclomatic Complexity (CC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Count of independent code paths. A straight-line function = CC 1. Each &lt;code&gt;if&lt;/code&gt;, &lt;code&gt;for&lt;/code&gt;, &lt;code&gt;while&lt;/code&gt;, &lt;code&gt;except&lt;/code&gt; adds 1.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;fhval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;flamehaven-validator. An external tool that interrogates the scorer from outside the codebase. Its purpose is to catch cases where internal test consistency masquerades as correctness.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SPAR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Subcommand of fhval. Adversarial regression loop with three layers. Tests whether the scorer measures what it claims to measure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jensen-Shannon Divergence. A symmetric, bounded (0–1) measure of divergence between two probability distributions. Used here to compare AST node-type histograms between functions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree. The parsed structure of source code. An &lt;code&gt;if&lt;/code&gt; statement, a &lt;code&gt;return&lt;/code&gt;, a function call each become typed nodes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;function_clone_cluster&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New pattern in v3.1.0. Detects files where many functions share near-identical AST structure — the fragmented god function evasion pattern.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;placeholder_variable_naming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New pattern in v3.1.0. Detects vocabulary-clean code with zero semantic content: single-letter parameter floods, sequential numbered variables.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AM/GM gap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core refinement in v3.1.0. The calibrator used an arithmetic mean (simpler approximation); the scorer uses a geometric mean (the precise target formula). Aligning them closes a ~5-7pt estimation gap on uneven files.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Quick context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI SLOP Detector&lt;/a&gt; is a static analyzer that measures structural code quality — not style, not formatting. It scores each file across four dimensions and assigns a &lt;code&gt;deficit&lt;/code&gt; between 0 (clean) and 100 (critical):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LDR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ratio of executable logic to total lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inflation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jargon, docstring bloat, unsupported claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unreachable paths, copy-pasted blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern hit rate (stubs, god functions, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These four numbers feed a single formula — a weighted geometric mean — called the &lt;strong&gt;GQG&lt;/strong&gt;. The output is the deficit score: &lt;code&gt;100 × (1 - GQG)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The calibrator's job is to find the best weights for that formula by searching over thousands of known cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before v3.1.0: the self-scan
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefrv0jxl1j4mnk2q1bnk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefrv0jxl1j4mnk2q1bnk.png" alt="3critical blind spots discovered" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We don't ship a version without running the detector against itself. Before cutting v3.1.0, we ran v3.0.3 — a structural debt reduction pass on the three highest-deficit files in the codebase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Self-scan before: avg_deficit=23.57, 15 deficit files, status=suspicious
Self-scan after:  avg_deficit=20.33, 12 deficit files, status=clean
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;analysis/cross_file.py&lt;/code&gt; dropped from 70.3 to 28.7 (critical → clean). &lt;code&gt;ci_gate.py&lt;/code&gt; from 69.3 to 22.3. &lt;code&gt;cli.py&lt;/code&gt; from 68.4 to 20.9. The fixes were mechanical: extracted nested closures to private methods, replaced &lt;code&gt;if/elif/else&lt;/code&gt; dispatch chains with dict dispatch, removed re-declared constants.&lt;/p&gt;

&lt;p&gt;The point is not that these numbers are good. It's that the tool had to earn its own PASS before we shipped the version that refines the formula. Shipping a scoring engine while your own codebase sits at &lt;code&gt;suspicious&lt;/code&gt; would have been its own kind of slop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The adversarial tester: fhval SPAR
&lt;/h2&gt;

&lt;p&gt;In a &lt;a href="https://dev.to/flamehaven01/i-built-an-ecosystem-of-46-ai-assisted-repos-then-i-realized-it-might-be-eating-itself-46ni"&gt;previous post&lt;/a&gt; we described &lt;code&gt;fhval&lt;/code&gt; — flamehaven-validator. The core concern: when every tool in an ecosystem is built by the same person against the same baseline, internal consistency can masquerade as correctness. Passing your own tests proves nothing about whether your tests are asking the right questions.&lt;/p&gt;

&lt;p&gt;For v3.1.0 we added a &lt;code&gt;spar&lt;/code&gt; subcommand — an adversarial regression loop that interrogates the scorer from the outside. Running SPAR against the v3.0.x scorer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SPAR score: 55 / 100  [FAIL]

Layer A anomalies:
  A3 stub_class_8_methods     expected &amp;gt;= 30  got 20.0  [ANOMALY]
  A4 fragmented_god_function  expected &amp;gt;= 10  got  0.0  [ANOMALY]
  A5 vocab_clean_meaningless  expected &amp;gt;=  8  got  0.0  [ANOMALY]

Layer C blind spots:
  C2 inflation_blindspot      [BLIND_SPOT]
  C3 ddc_annotation_gap       [BLIND_SPOT]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three gaps. Two documented scope limits. Score: 55 FAIL.&lt;/p&gt;

&lt;p&gt;Each gap pointed at a specific detection weakness. The SPAR methodology itself — how Layer A/B/C work, why adversarial ground truth is hard to author from inside the codebase — is a separate topic covered in tomorrow's post. Here we focus on what the gaps told us and what we changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 1: The calibrator and scorer were using different formulas
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ye6lvz5fzouhj62ksxv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ye6lvz5fzouhj62ksxv.png" alt="the geometric mean" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scorer computes a &lt;strong&gt;weighted geometric mean&lt;/strong&gt;. The calibrator — which finds optimal weights — was computing a &lt;strong&gt;weighted arithmetic mean&lt;/strong&gt; as its optimization target.&lt;/p&gt;

&lt;p&gt;Those are not the same thing, and for a quality gate, the difference is structural.&lt;/p&gt;

&lt;p&gt;Consider a file with three dimension scores: LDR=0.9 (good), inflation_quality=0.1 (very bad), DDC=0.8 (good).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Formula&lt;/th&gt;
&lt;th&gt;Calculation&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Deficit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Arithmetic mean&lt;/td&gt;
&lt;td&gt;(0.9 + 0.1 + 0.8) / 3&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geometric mean&lt;/td&gt;
&lt;td&gt;(0.9 × 0.1 × 0.8) ^ (1/3)&lt;/td&gt;
&lt;td&gt;0.42&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The arithmetic mean gives deficit=40. The geometric mean gives deficit=58. The gap is 18 points — not rounding, but structural. The geometric mean &lt;strong&gt;amplifies weak dimensions&lt;/strong&gt; because one bad score pulls the entire product down. The arithmetic mean averages over them.&lt;/p&gt;

&lt;p&gt;The scorer uses the geometric mean for good reason: a file with excellent LDR but zero actual logic (all docstrings) should not score deficit=30. It should score much higher. The formula enforces that.&lt;/p&gt;

&lt;p&gt;The first-generation calibrator used an arithmetic mean as a simpler starting approximation. So it was finding weights that minimize error against a different objective than the scorer actually computes. The result: roughly 5–7 point underestimation on files with uneven dimension profiles — which are precisely the target of this tool.&lt;/p&gt;

&lt;p&gt;The AM ≥ GM inequality means the calibrator's scores were always optimistic. For balanced files (all dimensions similar) the gap is small and harmless. For uneven files, it was systematic — and those are the cases that matter most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refinement:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (calibrator _recompute_deficit)
&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w_ldr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ldr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_inflation&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;inflation_n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_ddc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ddc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_w&lt;/span&gt;

&lt;span class="c1"&gt;# After — mirrors the scorer's GQG formula exactly
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;span class="n"&gt;gqg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;w_ldr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ldr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_inflation&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;inflation_n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_ddc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ddc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_w&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deficit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gqg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why SPAR anomaly A3 (&lt;code&gt;stub_class_8_methods&lt;/code&gt;) jumped from deficit 20.0 to 40.0: the stub class had heavily uneven dimensions, and the geometric mean scored it correctly once the calibrator was trained against the right target.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 2: The complexity modifier had a dead zone at the common end
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyfwsfopk911dg0uec96.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyfwsfopk911dg0uec96.png" alt="docstring bloat" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The inflation metric applies a complexity modifier to penalize functions that are simultaneously simple and jargon-heavy — a common pattern in AI-generated code: a two-line function surrounded by an elaborate docstring.&lt;/p&gt;

&lt;p&gt;The first-generation modifier formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;complexity_modifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_complexity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For CC=1: &lt;code&gt;1.0 + (1-3)/10 = 0.8&lt;/code&gt; → &lt;code&gt;max(1.0, 0.8)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;br&gt;
For CC=2: &lt;code&gt;1.0 + (2-3)/10 = 0.9&lt;/code&gt; → &lt;code&gt;max(1.0, 0.9)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;br&gt;
For CC=3: &lt;code&gt;1.0 + (3-3)/10 = 1.0&lt;/code&gt; → &lt;code&gt;max(1.0, 1.0)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CC=1, 2, and 3 all received the same modifier: 1.0. This meant simple functions — the three most common complexity levels — paid no complexity premium on inflation, regardless of how jargon-heavy they were. The modifier only activated from CC=4 upward.&lt;/p&gt;

&lt;p&gt;Simple jargon-heavy functions are the most common AI code signature. The formula was least sensitive precisely where it needed to be most sensitive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After — CC=1 is the baseline, not CC=3
&lt;/span&gt;&lt;span class="n"&gt;complexity_modifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_complexity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now CC=2 gets a 1.10× modifier, CC=3 gets 1.20×. The penalty scales from the simplest meaningful function upward.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 3: Purity weight was documented but not connected
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1d144fslfbtceumfn9o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1d144fslfbtceumfn9o.png" alt="catching stub piplelines and placeholder variable floods" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The GQG formula includes a purity dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;  &lt;span class="c1"&gt;# hardcoded constant
&lt;/span&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gqg_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;purity_penalty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.slopconfig.yaml&lt;/code&gt; had a &lt;code&gt;weights.purity&lt;/code&gt; field. The calibrator's weight search had a purity parameter. Neither was connected to this constant — users could configure &lt;code&gt;weights.purity: 0.20&lt;/code&gt; and nothing would change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# default unchanged; now configurable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. The config surface now matches the implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two new detection patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stub evasion: empty container returns
&lt;/h3&gt;

&lt;p&gt;The existing &lt;code&gt;return_constant_stub&lt;/code&gt; pattern caught &lt;code&gt;return True&lt;/code&gt;, &lt;code&gt;return 0&lt;/code&gt;, &lt;code&gt;return "string"&lt;/code&gt; — but not &lt;code&gt;return {}&lt;/code&gt;, &lt;code&gt;return []&lt;/code&gt;, &lt;code&gt;return ()&lt;/code&gt;, &lt;code&gt;return set()&lt;/code&gt;. These are equally common stub patterns in class skeletons:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# was not flagged before
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  &lt;span class="c1"&gt;# was not flagged before
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are now caught by &lt;code&gt;return_constant_stub&lt;/code&gt; and &lt;code&gt;interface_only_class&lt;/code&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Fragmented god function: AST clone detection
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimenh3uabx0lodo1yqs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimenh3uabx0lodo1yqs0.png" alt="AST Jensen-shannon divergence" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPAR anomaly A4 was a file with 12 one-liner helper functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.2&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.3&lt;/span&gt;
&lt;span class="c1"&gt;# ... through r12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function individually looks clean: low complexity, no nesting, short. No single function exceeds any per-function threshold. But collectively, this is a decomposed god function — a large computation split into structurally identical fragments that evade per-function gates.&lt;/p&gt;

&lt;p&gt;The new pattern: &lt;code&gt;function_clone_cluster&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works.&lt;/strong&gt; For each file, build a 30-dimensional histogram of AST node types for every function: how many &lt;code&gt;If&lt;/code&gt; nodes, &lt;code&gt;Return&lt;/code&gt; nodes, &lt;code&gt;Call&lt;/code&gt; nodes, &lt;code&gt;BinOp&lt;/code&gt; nodes, and so on. The histogram is normalized to a probability distribution. Then compute pairwise Jensen-Shannon Divergence between all function pairs. JSD is bounded between 0 and 1. Two functions with near-identical AST structure produce JSD close to 0.&lt;/p&gt;

&lt;p&gt;Functions with JSD &amp;lt; 0.05 get an edge in a graph. BFS finds connected components. The largest component is the clone cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Thresholds:
  &amp;gt;= 6 functions in cluster: CRITICAL
  &amp;gt;= 4 functions in cluster: HIGH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why JSD and not simpler metrics.&lt;/strong&gt; Cosine similarity or Euclidean distance on raw histograms don't handle sparse distributions well — short functions have mostly empty histograms, and small absolute differences dominate. JSD compares distributions rather than raw vectors, stable when most histogram dimensions are near zero. It also has an upper bound of 1, which makes the 0.05 threshold interpretable rather than dataset-dependent.&lt;/p&gt;

&lt;p&gt;The JSD threshold (0.05) was calibrated against the internal test corpus. It will produce false positives on files with many similar utility functions — for example, a large set of &lt;code&gt;_validate_field_X()&lt;/code&gt; validators that are structurally identical by design. Adjust via &lt;code&gt;--config&lt;/code&gt; if needed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Placeholder variable naming (v1.0)
&lt;/h3&gt;

&lt;p&gt;SPAR anomaly A5 was vocabulary-clean code with zero semantic content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
    &lt;span class="n"&gt;r3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
    &lt;span class="c1"&gt;# ... through r12
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No buzzwords. No docstring bloat. Every traditional linter passes this. The new &lt;code&gt;placeholder_variable_naming&lt;/code&gt; pattern applies two checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single-letter parameter density&lt;/strong&gt;: 5 or more single-letter parameters (excluding &lt;code&gt;self&lt;/code&gt;, &lt;code&gt;cls&lt;/code&gt;, &lt;code&gt;_&lt;/code&gt;) → HIGH.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential numbered variables&lt;/strong&gt;: a run of 8 or more → HIGH; 4 or more → MEDIUM.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is v1.0: it detects naming &lt;em&gt;style&lt;/em&gt;, not semantic quality. Known false positive zone: scientific and math libraries legitimately use single-letter conventions (&lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt;, &lt;code&gt;mu&lt;/code&gt;, &lt;code&gt;sigma&lt;/code&gt;). Suppress with &lt;code&gt;domain_overrides&lt;/code&gt; in &lt;code&gt;.slopconfig.yaml&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  SPAR result after v3.1.0
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SPAR score: 85 / 100  [PASS]

Layer A: 5/5 anchors consistent
Layer B: 4 documented limitations (no regressions)
Layer C: C2 inflation_blindspot [BLIND_SPOT — known scope limit]
         C3 ddc_annotation_gap  [BLIND_SPOT — known scope limit]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;55 → 85 PASS.&lt;/p&gt;

&lt;p&gt;The two remaining blind spots are not gaps to close — they're the documented scope limits of static analysis: a tool that reads AST cannot determine whether arithmetic is semantically meaningful, or whether annotation-heavy imports serve a real runtime purpose. Those require a different class of model. Documenting the ceiling is part of the job.&lt;/p&gt;

&lt;p&gt;The full SPAR methodology — how Layer A/B/C work, why Layer A ground truth is hard to author from inside the codebase, and what "validating the validator" means in practice — is covered in tomorrow's post.&lt;/p&gt;




&lt;h2&gt;
  
  
  v3.1.1: the self-inspection patch
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5ynlx4hh85jso5rqdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5ynlx4hh85jso5rqdq.png" alt="dog food" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v3.1.0 and v3.1.1 shipped on the same day. The clone detection pattern introduced in v3.1.0 had a visibility gap: &lt;code&gt;function_clone_cluster&lt;/code&gt; fired in the Issues section but produced no signal in the Core Metrics table. A community issue caught it within hours.&lt;/p&gt;

&lt;p&gt;But before cutting v3.1.1, we ran the tool against itself — and the new patterns found something:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;placeholder.py    deficit: 70.3  [CRITICAL_DEFICIT]
python_advanced.py  deficit: 74.0  [CRITICAL_DEFICIT]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both files are part of the detection engine itself. Root cause: &lt;code&gt;check_node&lt;/code&gt; methods with cyclomatic complexity 20–31, caused by compound boolean logic that had accumulated across releases. The tool was flagging its own pattern implementations as having the exact complexity problems it was designed to detect.&lt;/p&gt;

&lt;p&gt;We extracted four module-level helpers in &lt;code&gt;placeholder.py&lt;/code&gt; (&lt;code&gt;_strip_docstring&lt;/code&gt;, &lt;code&gt;_has_abstractmethod&lt;/code&gt;, &lt;code&gt;_empty_container_repr&lt;/code&gt;, &lt;code&gt;_is_placeholder_stmt&lt;/code&gt;) and added &lt;code&gt;_make_god_issue()&lt;/code&gt; and &lt;code&gt;_collect_numbered_vars()&lt;/code&gt; to &lt;code&gt;python_advanced.py&lt;/code&gt;. Each &lt;code&gt;check_node&lt;/code&gt; method went from 20–70 lines to 8–15. The detector earned its own PASS before shipping the patch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;placeholder.py      70.3 → 43.7  [CRITICAL → SUSPICIOUS]
python_advanced.py  74.0 → 66.7  [CRITICAL → INFLATED_SIGNAL]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional v3.1.1 refinements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clone Detection row&lt;/strong&gt; added to Core Metrics table (CRITICAL/PASS at a glance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table style unified&lt;/strong&gt; to &lt;code&gt;box.ROUNDED&lt;/code&gt; across all project output (was mixing three styles).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VS Code extension&lt;/strong&gt;: &lt;code&gt;extractJson()&lt;/code&gt; strips &lt;code&gt;[INFO]&lt;/code&gt; log lines before &lt;code&gt;JSON.parse&lt;/code&gt; — previously caused silent parse failures when CLI log output appeared alongside JSON. Workspace analysis replaced with a QuickPick list of deficit files sorted by score; clicking opens the file in the editor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you installed 3.1.0, upgrade to 3.1.1 before using clone detection in CI.&lt;/p&gt;




&lt;h2&gt;
  
  
  How this fits alongside existing tools
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbl4ag50tf34mrhioj0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbl4ag50tf34mrhioj0q.png" alt="compare" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;What it sees that others don't&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semgrep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern-matching on AST&lt;/td&gt;
&lt;td&gt;Rule violations you've pre-authored&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SonarQube&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cognitive complexity, duplication, coverage&lt;/td&gt;
&lt;td&gt;Complexity, coverage gaps — not structural emptiness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Radon&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cyclomatic complexity&lt;/td&gt;
&lt;td&gt;Raw CC values; used internally by AI SLOP Detector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bandit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security rules&lt;/td&gt;
&lt;td&gt;Security vulnerabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mutmut / cosmic-ray&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mutation testing&lt;/td&gt;
&lt;td&gt;Whether your test suite catches real bugs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI SLOP Detector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric-based structural analysis&lt;/td&gt;
&lt;td&gt;Docstring theater, stub pipelines, fragmented logic, phantom imports&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key gap: a file can be fully SonarQube-clean while containing zero actual logic — all stubs, all docstrings, all type annotations. Cognitive complexity doesn't measure whether the complexity is real. LDR does. Inflation does.&lt;/p&gt;

&lt;p&gt;The complementary tool here is mutation testing. SPAR tests whether the scorer measures what it claims. Mutation testing tests whether your tests catch what they claim to catch. Both are adversarial approaches to the meta-problem: how do you validate the validator?&lt;/p&gt;




&lt;h2&gt;
  
  
  Score evolution
&lt;/h2&gt;

&lt;p&gt;If you're running AI SLOP Detector on an existing project, upgrading to 3.1.x will change your scores. The formula alignment in Refinement 1 increases deficit on files with uneven dimension profiles, typically by 3–8 points. This is not drift — it's the scorer becoming more precise in the region where it matters most. Files that were borderline &lt;code&gt;suspicious&lt;/code&gt; may move into &lt;code&gt;inflated_signal&lt;/code&gt;. Check your CI threshold after upgrading.&lt;/p&gt;

&lt;p&gt;Previous scores were valid estimates produced by the first-generation model. v3.1.x scores are tighter estimates with better sensitivity where dimensions are uneven — which is precisely the profile of AI-generated code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;function_clone_cluster&lt;/code&gt; threshold (JSD &amp;lt; 0.05) was calibrated against the internal test corpus. It will fire false positives on legitimate utility function clusters. Adjust via &lt;code&gt;--config&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;placeholder_variable_naming&lt;/code&gt; v1.0 has no semantic context. &lt;code&gt;def distance(x, y, z)&lt;/code&gt; is legitimate; the pattern doesn't know that.&lt;/li&gt;
&lt;li&gt;SPAR score 85 means five ground truth anchors pass and eight of ten Layer C probes hold. The space of evasion patterns is open-ended. More in tomorrow's SPAR post.&lt;/li&gt;
&lt;li&gt;The Layer A corpus is internally authored. External adversarial contributions would make it stronger.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Install / upgrade
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector&lt;span class="o"&gt;==&lt;/span&gt;3.1.1
&lt;span class="c"&gt;# or&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; ai-slop-detector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VS Code extension: search &lt;strong&gt;"AI SLOP Detector"&lt;/strong&gt; in Extensions, or install from VSIX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;code &lt;span class="nt"&gt;--install-extension&lt;/span&gt; vscode-slop-detector-3.1.1.vsix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scan a project&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; ./your-project

&lt;span class="c"&gt;# Machine-readable output&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; ./your-project &lt;span class="nt"&gt;--json&lt;/span&gt; | jq &lt;span class="s1"&gt;'.file_results[] | {file: .file_path, deficit: .deficit_score}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;GitHub: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previous posts in this series:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/the-tool-that-turned-on-itself-ai-slop-detector-v290-v291-3oc4"&gt;v2.9.0/v2.9.1: The Tool That Turned On Itself&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v270-why-we-built-a-linter-we-actually-use-2nb6"&gt;v2.7.0: Why We Built a Linter We Actually Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v263-is-live-on-vs-code-3oj4"&gt;v2.6.3: Now on VS Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/i-built-an-ecosystem-of-46-ai-assisted-repos-then-i-realized-it-might-be-eating-itself-46ni"&gt;fhval: I Built an Ecosystem of 46 AI-Assisted Repos. Then I Realized It Might Be Eating Itself.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>architecture</category>
      <category>devtool</category>
    </item>
    <item>
      <title>My AI Maintainer Kept Making Wrong Calls. So I Made It Report Its State Before Touching Anything.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:24:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7</link>
      <guid>https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7</guid>
      <description>&lt;h2&gt;
  
  
  🔎 Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;memory_injection&lt;/strong&gt;: A MICA operational mode. The archive is updated after each maintenance session and read by the next AI session to compensate for session amnesia.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report Format&lt;/strong&gt;: The structured opening output the model must produce at session start — declaring archive version, self-test result, drift status, and active invariants — before any repository-level work begins.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Self-Test Policy&lt;/strong&gt;: Machine-evaluable checks that validate the archive against the real project state: file existence, hash integrity, and invocation protocol presence.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Drift Response Policy&lt;/strong&gt;: The schema-level declaration of how hash mismatches and missing files are handled. Different failure classes carry different response actions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Design Invariant&lt;/strong&gt;: A structured governance rule with identity, severity, and statement. Not a style guideline. A constraint the model cannot rationalize past.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Provenance Registry&lt;/strong&gt;: The record of tracked files with SHA256 hashes. The basis for drift detection.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Deviation Log&lt;/strong&gt;: The audit trail of formal exceptions to design invariants. Empty means no exceptions have been logged — not that no judgment calls were made.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6d3fkvj8ciwgp3p6u1eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6d3fkvj8ciwgp3p6u1eu.png" alt="coverimage" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What Part 5 Left Open
&lt;/h2&gt;

&lt;p&gt;Part 5 placed MICA inside the context engineering landscape and drew one boundary: MICA is not a collection system. It begins after collection ends. Its job is to govern what enters the session, what remains authoritative, and how the model proves it actually loaded the governed archive at all.&lt;/p&gt;

&lt;p&gt;That answer was correct. It was also still abstract.&lt;/p&gt;

&lt;p&gt;This post comes down from that framing. It shows what MICA looks like when it is actually running — not as a concept, but as a protocol inside a real project.&lt;/p&gt;

&lt;p&gt;The project is &lt;code&gt;flamehaven.space&lt;/code&gt;, a Next.js B2B site maintained by a solo operator using a MICA package in &lt;code&gt;memory_injection&lt;/code&gt; mode. Everything shown here is drawn from the live archive. Values that would expose internal configuration are anonymized; structure and behavior are real.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Session Opening Report
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekhdvby8iwqn2iby9j2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekhdvby8iwqn2iby9j2u.png" alt="The paradigm shift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every MICA session in &lt;code&gt;memory_injection&lt;/code&gt; mode begins with a declared output before any work starts. The archive specifies the required format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: flamehaven-space-maintainer v0.2.0
Self-test: PASS (3 checks -- ST-001, ST-002, ST-003)
Drift: no hash mismatch detected
Active invariants: DI-001 (critical), DI-002 (critical) + 24 others loaded
Gate: PASS -- proceeding with maintenance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a courtesy summary. It is a gate. The archive field &lt;code&gt;session_report_format.gate_block_on&lt;/code&gt; is set to &lt;code&gt;critical_self_test_failure&lt;/code&gt; — meaning the model must declare its load state before it is permitted to make any repository-level changes.&lt;/p&gt;

&lt;p&gt;The format is specified in the archive itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json
"session_report_format": {
  "trigger": "session_start",
  "required_fields": ["archive_version", "self_test", "drift_status", "active_invariants", "gate"],
  "format_template": "[SESSION READY]\nArchive: {archive_version}\nSelf-test: {self_test}\nDrift: {drift_status}\nActive invariants: {active_invariants}\nGate: {gate}"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model does not decide what to declare. The archive tells it what a valid session opening looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. What the Self-Test Actually Checks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4lxbrkjs3mhf8o3vbrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4lxbrkjs3mhf8o3vbrh.png" alt="self-test mechanics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;self_test_policy&lt;/code&gt; runs on &lt;code&gt;session_start&lt;/code&gt; and &lt;code&gt;pre_handoff&lt;/code&gt;. Three checks matter here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ST-001&lt;/strong&gt; (&lt;code&gt;provenance_sha256_format&lt;/code&gt;, severity: &lt;code&gt;error&lt;/code&gt;) — verifies that provenance hashes in the registry match the expected format. A malformed hash means the file fingerprint is untrustworthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ST-002&lt;/strong&gt; (&lt;code&gt;provenance_file_exists&lt;/code&gt;, severity: &lt;code&gt;warning&lt;/code&gt;) — verifies that files listed in the provenance registry actually exist on disk. A missing file is not a formatting error; it is a ghost reference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ST-003&lt;/strong&gt; (&lt;code&gt;invocation_pattern_present&lt;/code&gt;, severity: &lt;code&gt;error&lt;/code&gt;) — verifies that the invocation protocol is declared and readable. If the model cannot confirm how it was loaded, it cannot confirm the session is in a governed state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failure behavior is set per-check. The overall &lt;code&gt;on_failure&lt;/code&gt; policy for this archive is &lt;code&gt;warn_continue&lt;/code&gt; — the session proceeds, but the failure is surfaced explicitly in the opening report.&lt;/p&gt;

&lt;p&gt;This is a deliberate calibration. A site maintenance session that blocks hard on every provenance warning would be too brittle for solo operation. The severity ladder reflects the actual cost of each failure mode, not a theoretical maximum.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What Drift Detection Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye82pmwyaq0ek7d7qp2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye82pmwyaq0ek7d7qp2k.png" alt="drift response policy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The archive's &lt;code&gt;drift_response_policy&lt;/code&gt; is minimal by design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"drift_response_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_hash_mismatch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warn_continue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_file_missing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warn_block"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reminder_after_change"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"inline_sync_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The distinction between &lt;code&gt;warn_continue&lt;/code&gt; and &lt;code&gt;warn_block&lt;/code&gt; is operationally significant.&lt;/p&gt;

&lt;p&gt;A hash mismatch means a tracked file changed — which happens legitimately during ordinary maintenance. The model surfaces the mismatch and continues.&lt;/p&gt;

&lt;p&gt;A file that has gone missing entirely is a different failure class. The model blocks and requires operator acknowledgment before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reminder_after_change: true&lt;/code&gt; means the archive instructs the model to remind the operator to refresh the provenance registry and artifact manifest before minting the next archive version. This is not automated enforcement. It is memory injection: the archive tells the next session what the previous session should have reminded the operator to do.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;deviation_log&lt;/code&gt; in v0.2.0 is empty. That is not a sign the system has never been used. It means no deviations have been formally logged yet — which is itself a state the archive captures.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What Happens When a Deployment Changes Something
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focdk66if1kifuqc9znha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focdk66if1kifuqc9znha.png" alt="deployment evolution" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A concrete scenario: the operator ships a writing refresh mode change — switching from automatic ISR to manual operator-triggered revalidation. Three files change: &lt;code&gt;next.config.ts&lt;/code&gt;, a helper &lt;code&gt;.bat&lt;/code&gt; script, and the playbook.&lt;/p&gt;

&lt;p&gt;On the next session open:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Self-test runs. ST-002 may flag if the helper script path is not in the provenance registry yet.&lt;/li&gt;
&lt;li&gt;Drift check runs. Hash mismatches fire for the changed files. &lt;code&gt;on_hash_mismatch: warn_continue&lt;/code&gt; — the session proceeds.&lt;/li&gt;
&lt;li&gt;The opening report surfaces the mismatch:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: flamehaven-space-maintainer v0.2.0
Self-test: PASS with warnings (ST-002: update-writing-now.bat not in provenance registry)
Drift: hash mismatch on next.config.ts, flamehaven-space-maintainer-playbook.v0.2.0.md
Active invariants: DI-001 (critical), DI-002 (critical) + 24 others
Gate: PASS WITH WARNINGS -- operator review required
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The operator now has a concrete decision surface before touching anything: what changed, what the model knows about, and what it does not.&lt;/p&gt;

&lt;p&gt;The model then follows the change process defined in the playbook: identify the canonical subsystem touched, patch the smallest coherent surface, run build and audit, verify route-level behavior, then update README, MICA, or playbook if the change alters maintainer truth.&lt;/p&gt;

&lt;p&gt;At the end of the session, if a new archive version is minted, the synchronization rule is explicit: file name, &lt;code&gt;project.version&lt;/code&gt;, and the archive handoff marker must be updated in the same change. Not sequentially. The same change.&lt;/p&gt;

&lt;p&gt;This is the operational point of drift reporting: not merely to announce that something changed, but to force the model and the operator to see the same changed surface before any new work proceeds.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. What the Design Invariants Actually Govern
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfbq1ujtnozummxbd8pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfbq1ujtnozummxbd8pl.png" alt="design invariants" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The archive carries 26 design invariants. The first six establish the perimeter everything else operates within:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DI-001&lt;/strong&gt; (critical): Flamehaven is positioned as a governance-first, founder-led B2B AI systems practice, not a generic agency or AI wrapper shop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-002&lt;/strong&gt; (critical): Primary conversion surface is the main domain &lt;code&gt;flamehaven.space&lt;/code&gt;, not Medium, Substack, DEV.to, or LinkedIn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-003&lt;/strong&gt; (high): Writing detail pages are authoritative artifacts linked to projects, selected work, contact, and SEO canonical ownership.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-004&lt;/strong&gt; (high): Selected Work must distinguish public, private, and internal systems without broken public links or placeholder copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-005&lt;/strong&gt; (high): Legacy WordPress-era routes must redirect away from obsolete agency messaging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-006&lt;/strong&gt; (high): Operational choices favor deterministic behavior, inspectability, and maintenance continuity over decorative complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not style guidelines. They are session-blocking constraints. An AI that proposes converting Selected Work to a live-fetch real-time surface is violating DI-006. An AI that treats cross-posting as the canonical publishing path is violating DI-002. An AI that leaves a legacy route live because a redirect “seems unnecessary” is violating DI-005.&lt;/p&gt;

&lt;p&gt;The invariants exist so the model cannot rationalize its way past the operator's architectural intent — even across sessions, even with a new model instance that has never seen the project before.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. What MICA Cannot Do Here
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7tpmb7ohzye0el0dj8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7tpmb7ohzye0el0dj8i.png" alt="system limits and human authority" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;deviation_log&lt;/code&gt; is empty because no formal deviation has been logged. But there have been judgment calls.&lt;/p&gt;

&lt;p&gt;One example is the writing hero image fallback logic. The site had to support both modern Notion block structures and legacy imported posts with a different nesting format. That decision did not begin as an invariant. It began as a session-level judgment call, became a playbook rule, and only then became stable maintainer truth.&lt;/p&gt;

&lt;p&gt;That path matters.&lt;/p&gt;

&lt;p&gt;MICA does not automate the step from “we discussed this and made a call” to “this is now a governed constraint.” It provides the place to put the result. The operator decides what rises to the level of an invariant, what remains a lesson in the playbook, and what disappears when the session ends.&lt;/p&gt;

&lt;p&gt;That boundary — what gets governed, what gets remembered, what gets lost — is not a gap in MICA. It is a design decision made with every archive update.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What Part 7 Will Address
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjhw3bmshi00jb3efpmd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjhw3bmshi00jb3efpmd.png" alt="preview part7" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 6 showed what MICA looks like in operation inside a single maintenance agent. The structure holds. The protocol runs. The session report is predictable.&lt;/p&gt;

&lt;p&gt;But that is still the easier case.&lt;/p&gt;

&lt;p&gt;The project being governed was a site: a relatively stable artifact, maintained by one operator, where the main problem was making sure the model did not forget what already mattered.&lt;/p&gt;

&lt;p&gt;Part 7 moves into a harder setting.&lt;/p&gt;

&lt;p&gt;The governed project is now a tool that runs inside AI workflows itself. That changes the governance problem. The issue is no longer only session amnesia. It is iterative accumulation: what the system learns across cycles, what becomes authoritative, what remains provisional, and what must be carried forward without allowing drift to harden into false memory.&lt;/p&gt;

&lt;p&gt;Part 7 is that case.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; A session report is not a polite summary. It is a hard gate. The model must declare — in the exact format dictated by the archive — what it loaded, what tests passed, and what drift it detected, before it is allowed to touch the repository.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.2.0 standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>contextengineering</category>
      <category>governance</category>
    </item>
    <item>
      <title>How Auditing 10 Bio-AI Repositories Shaped STEM-AI</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 30 Mar 2026 12:07:20 +0000</pubDate>
      <link>https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5</link>
      <guid>https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktbn484a53idqb1b8rm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktbn484a53idqb1b8rm8.png" alt="Cover image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reading path:&lt;/strong&gt;&lt;br&gt;
This post is part of a series.&lt;br&gt;
(1) &lt;a href="https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f"&gt;STEM-AI introduction&lt;/a&gt; — what the framework is and why we built it&lt;br&gt;
(2) &lt;a href="https://flamehaven.space/writing/bio-ai-repository-audit-2026-a-technical-report-on-10-open-source-systems/" rel="noopener noreferrer"&gt;Technical audit report&lt;/a&gt; — full findings across 10 repositories&lt;br&gt;
(3) &lt;a href="https://flamehaven.space/writing/i-audited-10-open-source-bio-ai-repos-most-could-produce-outputs-few-could-establish-trust/" rel="noopener noreferrer"&gt;Narrative summary&lt;/a&gt; — what those findings actually mean&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What Text Could See — and What Code Revealed&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favblcoe473uege82hbwl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favblcoe473uege82hbwl.png" alt="what text could see" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In March 2026, we ran STEM-AI against 10 high-visibility open-source bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;The framework did what it was designed to do. It surfaced missing disclaimers, absent CI, weak reproducibility signals, and public-facing governance gaps. Those findings mattered, and the scores were directionally right.&lt;/p&gt;

&lt;p&gt;But when we reviewed the audits more carefully, one pattern kept appearing: some of the most consequential failures were not visible in the artifact surface at all. They only became obvious when we looked directly at the code.&lt;/p&gt;

&lt;p&gt;This function lives inside a repository presented as an AI-driven drug discovery workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def generate_analogues(self, seed_smiles: str, count: int = 3):
    """
    Mocks a generative model (like REINVENT).
    In a real app, this would call a PyTorch model.
    """
    # Simple string manipulation for demo purposes
    analogues = []
    for i in range(count):
        if "C" in seed_smiles:
            new_smi = seed_smiles.replace("C", "C(C)", 1) if i == 0 else seed_smiles + "F"
            analogues.append(new_smi)
        else:
            analogues.append(seed_smiles + "C")
    return analogues
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It does not generate molecules. It appends characters.&lt;/p&gt;

&lt;p&gt;SMILES (Simplified Molecular Input Line Entry System) is a strict notation for molecular structure. A valid SMILES string encodes real geometry and bonding. Appending C produces a syntactically valid string that represents no real compound. The function runs without error, returns a list, and the pipeline continues downstream.&lt;/p&gt;

&lt;p&gt;Our framework scored this repository T0. Correctly. But not because it saw this function.&lt;/p&gt;

&lt;p&gt;It scored T0 because the README was missing disclaimers. The CI was absent. Reproducibility was undocumented. Text-path evaluation is designed to measure exactly that. It did.&lt;/p&gt;

&lt;p&gt;The audit result was correct. The evidence surface had room to go deeper.&lt;/p&gt;

&lt;p&gt;Running the audits showed us what code-path evaluation could add on top.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What Code Access Makes Visible&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhefzdhoiz8nakmrf3959.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhefzdhoiz8nakmrf3959.png" alt="What Code Access Makes Visible" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The drug discovery example was not unusual.&lt;/p&gt;

&lt;p&gt;CellAgent's pipeline ends with this call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py, line 153
&lt;/span&gt;&lt;span class="n"&gt;final_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_final_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The method exists. Its body is &lt;code&gt;pass&lt;/code&gt;. The pipeline completes without error and produces nothing. A text audit reading the README would have no way to know this.&lt;/p&gt;

&lt;p&gt;BioAgents includes a rate limiter for external API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// rateLimiter.ts, lines 62-68&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;USE_JOB_QUEUE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rateLimiter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Job queue disabled - skip rate limiting for direct calls&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;USE_JOB_QUEUE&lt;/code&gt; defaults to &lt;code&gt;false&lt;/code&gt; in &lt;code&gt;.env.example&lt;/code&gt;. Every default deployment has rate limiting disabled. The function name implies protection. In default operation, there is none.&lt;/p&gt;

&lt;p&gt;The pattern across all three: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the code looks governed. &lt;/li&gt;
&lt;li&gt;the behavior tells a different story. &lt;/li&gt;
&lt;li&gt;That story is only visible when you read the code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Text scores and code behavior can diverge. Knowing where and how they diverge is the next layer of evidence worth capturing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Directions the Audits Opened
&lt;/h2&gt;

&lt;p&gt;Reviewing all ten audits, we identified four areas where code-path evaluation could extend what text auditing already does well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Direction 1: Clinical exposure is visible in imports, not just in README text.
&lt;/h3&gt;

&lt;p&gt;A repository importing pharmacogenomics allele tables has clinical exposure regardless of what its README says. Detecting that dependency at the import level — rather than waiting for a disclaimer — lets the framework flag exposure earlier. &lt;/p&gt;

&lt;p&gt;The key distinction is severity: a direct pharmacogenomics import (&lt;code&gt;CYP2D6&lt;/code&gt;, &lt;code&gt;CPIC&lt;/code&gt;) signals live patient-facing risk and is classified CA-DIRECT. &lt;/p&gt;

&lt;p&gt;A general-purpose medical imaging library like &lt;code&gt;pydicom&lt;/code&gt; or MONAI is classified CA-INDIRECT — research-use exposure, not necessarily a live clinical output path. The import alone does not determine clinical risk; the classification tier does.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 2: Not all clinical proximity is the same.
&lt;/h3&gt;

&lt;p&gt;A live pharmacogenomics dosage tool and a README roadmap note about a future ClinVar integration are not equivalent risks. Differentiating them — live output vs. research context vs. planned feature — makes the evaluation more precise and makes the accountability expectations more appropriate.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 3: Scoring stability is worth measuring directly.
&lt;/h3&gt;

&lt;p&gt;We ran Stage 1 on one repository in multiple passes. The results ranged across 28 points on the same input. Overlapping trigger conditions between hype-detection items are one contributing factor. &lt;/p&gt;

&lt;p&gt;LLM runtime stochasticity is another — the exact split between the two is still under measurement. Adding explicit discrimination examples — what exact phrasing triggers each item, what does not — makes the scoring surface cleaner and reduces the most obvious sources of variance.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 4: Code-path behavior deserves its own scan layer.
&lt;/h3&gt;

&lt;p&gt;A fail-open pattern is a control path that appears to enforce a constraint but defaults to bypassing it. The BioAgents rate limiter above is the example. In a clinical output path, a silent pass-through is not graceful degradation. It is an untraced result that looks like a real one. Building a dedicated scan for these patterns adds a check that text auditing was never meant to provide.&lt;/p&gt;




&lt;p&gt;These four directions came directly from running the audits. The scores across the 10 repositories remain as published. Code-path evaluation is what the framework can now add on top of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What v1.0.6 Added — Carried Forward in v1.1.2&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;These changes were introduced in v1.0.6 and are carried forward in the current internal v1.1.2 package. They extend the framework's evidence surface into code-level behavior. Calibration is ongoing.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Two Evidence Paths, Not One
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkol0yxx1ht5r0svy5yk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkol0yxx1ht5r0svy5yk.png" alt="Two Evidence Paths, Not One" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We narrowed one of the biggest divergence points by splitting evaluation into a text path and a code path.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;text path&lt;/strong&gt; works as before: read the README, CHANGELOG, and public posts, score against the rubric. Always available regardless of access to the repository.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;code path&lt;/strong&gt; activates when the audit has a local clone. It runs through Claude Code, Codex CLI, Gemini CLI, or Copilot CLI. Claims are not interpreted. They are measured. A README that says "IRB-approved data" earns no points for the statement. Points require a provenance artifact in the code.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When code confirms the README, that is a positive signal. When it contradicts it, that contradiction is the finding.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Clinical Dependency Detection
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbakbfbq74n5vp5d2rig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbakbfbq74n5vp5d2rig.png" alt="Clinical Dependency Detection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the start of every local audit, a scan script reads Python imports and README keywords. It classifies the result into one of three severity levels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ca_detection_scan.sh -- pharmacogenomics section&lt;/span&gt;

&lt;span class="c"&gt;# CA-DIRECT: live patient-facing output risk&lt;/span&gt;
check_import &lt;span class="s2"&gt;"CPIC|cpic|PharmGx|pharmacogenomic"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Pharmacogenomics (CPIC/PharmGx)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-DIRECT"&lt;/span&gt;

check_import &lt;span class="s2"&gt;"DPYD|CYP2D6|CYP2C19|allele"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Pharmacogene alleles"&lt;/span&gt; &lt;span class="s2"&gt;"CA-DIRECT"&lt;/span&gt;

&lt;span class="c"&gt;# CA-INDIRECT: research-use clinical exposure&lt;/span&gt;
check_import &lt;span class="s2"&gt;"import pydicom|from pydicom"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"pydicom (DICOM imaging)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;

check_import &lt;span class="s2"&gt;"import monai|from monai"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"MONAI (medical AI)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accountability requirements follow the actual clinical proximity of the code. Not the aspirational proximity of the roadmap. A roadmap mention without active implementation is treated as CA-PLANNED rather than collapsed into the same bucket as live clinical output. &lt;/p&gt;

&lt;p&gt;The pattern matching is against import statements and function names, not comment text. False positive calibration is still in progress.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Code Integrity Scanning (C1-C4)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd69g7synh94iizuciqpy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd69g7synh94iizuciqpy.png" alt="Code Integrity Scanning (C1-C4)" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A second scan handles four code-level checks: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hardcoded credentials (C1)&lt;/li&gt;
&lt;li&gt;unpinned dependencies (C2)&lt;/li&gt;
&lt;li&gt;clinical-path stubs (C3)&lt;/li&gt;
&lt;li&gt;fail-open exception handlers (C4).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The C4 check targets the BioAgents-style pattern. Searching for clinical keywords on the &lt;code&gt;except:&lt;/code&gt; line misses most real cases. Clinical context lives in function names and surrounding code. The scan uses a two-pass approach: first identify files with clinical-domain context, then find silent exception handlers within those files.&lt;/p&gt;

&lt;p&gt;A silent &lt;code&gt;except: pass&lt;/code&gt; in a clinical-context file is a trust-surface failure. The scan makes it visible without requiring a reviewer to read every exception block manually.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Discrimination Examples
&lt;/h3&gt;

&lt;p&gt;To reduce the 28-point variance, we added explicit examples for each hype-detection item: what exact phrasing triggers it, what does not, and what the documented edge cases are.&lt;/p&gt;

&lt;p&gt;The goal is to reduce obvious scoring drift enough that the same repository is no longer interpreted as two different trust surfaces across different auditors or different LLMs. That goal is not yet verified. The discrimination examples are the primary mechanism toward it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;The Three Questions Now Have a Fourth&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the first post, we described three questions the framework was built around:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Did the repository describe its limits honestly?&lt;/li&gt;
&lt;li&gt;Did public communication remain consistent with those limits?&lt;/li&gt;
&lt;li&gt;Did the codebase show evidence of maintenance and biological responsibility?&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Running 10 real audits pointed toward a fourth:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;4. Does the code actually do what the documentation says — and where it diverges, is that divergence visible and traceable?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That fourth question is what the audit outputs kept surfacing. A function name that sounds real. A pipeline that looks complete. An output that is plausible. An implementation that is a stub, or a control path that silently bypasses its own constraint.&lt;/p&gt;

&lt;p&gt;The first three questions can be answered by reading. The fourth requires looking at the code.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What the Framework Added — and What Stays the Same&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first post ended with this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"STEM-AI is meant to support serious review, not replace it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That has not changed. Every report carries a non-removable disclaimer: LLM-generated audit, not a regulatory determination, not clinical certification. Every report carries an expiry date. The minimum threshold for supervised pilot consideration is still T3. None of the March 2026 repositories reached it.&lt;/p&gt;

&lt;p&gt;What the audits added is narrower — broader evidence coverage on top of an already-working foundation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yoxv47netm2tmf881xg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yoxv47netm2tmf881xg.png" alt="What the Framework Added — and What Stays the Same" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A verifiable artifact shifts the accountability surface — it does not eliminate the possibility of falsification. The framework treats its presence as a necessary condition, not a sufficient one.&lt;/p&gt;

&lt;p&gt;What the framework gained is that the evidence it counts now extends beyond what authors say about their own code.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Three Directions Still Ahead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvbbl6bgo1z79mrg7wgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvbbl6bgo1z79mrg7wgr.png" alt="Three Directions Still Ahead" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated re-audit on repository changes.&lt;/strong&gt; A score from three months ago may not describe the same repository. The trajectory signal measures issue close rate and release frequency across consecutive 90-day windows. It is a partial answer. A CI-triggered re-audit path is the logical next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The denominator problem.&lt;/strong&gt; Zero of 10 repositories reached T3. This may accurately describe the ecosystem's current state. It may also reflect calibration issues in the upper tiers. Distinguishing between the two requires before-and-after auditing of repositories that have received systematic governance remediation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Stage 2 redistribution question.&lt;/strong&gt; Most audits have no cross-platform consistency data. When that data is unavailable, the framework redistributes Stage 2's weight equally between documentation quality and engineering accountability. &lt;/p&gt;

&lt;p&gt;For repositories with clinical-direct exposure, a well-written README can then compensate for weak code accountability. A guardrail flags this condition. The current redistribution rule is explicit but not yet final — it remains one of the framework's open calibration questions.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;If there are open-source bio-AI repositories you think should be &amp;gt;audited next, drop them in the comments. Bonus if they claim clinical &amp;gt;relevance, drug discovery, or medical reasoning.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;STEM-AI v1.1.2 — Trust Evaluation Framework for Medical AI Repositories.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;"Code works. But does the author care about the patient?"&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>bioinformatics</category>
      <category>healthtech</category>
    </item>
    <item>
      <title>Everyone Was Talking About Context Engineering. Nobody Had Solved Governance.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 25 Mar 2026 09:29:42 +0000</pubDate>
      <link>https://dev.to/flamehaven01/everyone-was-talking-about-context-engineering-nobody-had-solved-governance-424j</link>
      <guid>https://dev.to/flamehaven01/everyone-was-talking-about-context-engineering-nobody-had-solved-governance-424j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Disclosure: This article was written by the author with AI assistance for editing. All technical content, architecture decisions, and design rationale are the author's own. #ABotWroteThis&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Fail-Closed Gate&lt;/strong&gt;: An admission rule that excludes a context item if it fails a required threshold — regardless of its score on other dimensions. No exceptions. Introduced in v0.1.7.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;README-as-Protocol&lt;/strong&gt;: The pattern in which an AI session's natural behavior of reading the README first is formalized as the primary invocation mechanism. No installation required. Introduced in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation Protocol&lt;/strong&gt;: The schema-level declaration of how a MICA archive reaches an AI session — and how the session confirms it was loaded. Formalized as a required field in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report Format&lt;/strong&gt;: The structured opening report the model must produce at session start to confirm the archive was loaded. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Design Invariant Entry&lt;/strong&gt;: A structured governance rule with identity, rule text, and severity. Replaced plain string invariants in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Self-Test Policy&lt;/strong&gt;: Machine-evaluable checks that validate the archive against the real project state — file existence, hash integrity, and README sync. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Playbook&lt;/strong&gt;: The operator-facing discipline layer that sits outside the schema. The schema enforces structure; the Playbook enforces judgment.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Context Engineering&lt;/strong&gt;: The practice of shaping what the model sees, in what order, with what boundaries, and under what assumptions — not just what you ask it, but what it actually knows at runtime.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;CTX&lt;/strong&gt;: A collection-first context packaging approach that gathers relevant workspace material and delivers it to the model. In this article, CTX represents the collection layer of context engineering — answering, “What does the AI see?”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqx6707g8j9hu59d7wiu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqx6707g8j9hu59d7wiu.png" alt="Trustworthy context engineering" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What Parts 1 Through 4 Actually Established
&lt;/h2&gt;

&lt;p&gt;The first four parts of this series already narrowed the problem considerably.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/my-llm-kept-forgetting-my-project-so-i-built-a-governance-schema-4bo6"&gt;Part 1&lt;/a&gt; defined the failure mode.&lt;/strong&gt;&lt;br&gt;
The issue was not that long-running AI work needed “more prompt.” The issue was that a model can only act on what it actually knows right now, and most project context systems still treat that as a document problem instead of a governance problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626"&gt;Part 2&lt;/a&gt; established the first hard boundary.&lt;/strong&gt;&lt;br&gt;
A schema can exist, and the model can still have no reliable way to know it exists. That is the difference between a document and a constraint.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-stake-was-governance-outside-the-schema-mica-v015-pulled-it-in-46n9"&gt;Part 3&lt;/a&gt; moved governance into the schema.&lt;/strong&gt;&lt;br&gt;
Provenance, deviations, and semantic rules could no longer remain outside the system in READMEs, comments, or team habits. They had to become machine-readable structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9"&gt;Part 4&lt;/a&gt; made that structure operative.&lt;/strong&gt;&lt;br&gt;
The model already treated the README as its natural entry surface. Once that behavior was declared as an invocation protocol, the schema stopped being a passive archive and became a runtime contract.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That progression defines what MICA is actually trying to solve.&lt;/p&gt;

&lt;p&gt;It is not trying to be “better prompting.”&lt;br&gt;&lt;br&gt;
It is not trying to be “more retrieval.”&lt;br&gt;&lt;br&gt;
It is trying to answer a narrower question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does governed context reach the model, under declared rules, with confirmable load and auditable change?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the bridge into the broader landscape.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Context Engineering Was Never Just Prompting
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1mx3x0xenx6pc80hd16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1mx3x0xenx6pc80hd16.png" alt="The structural gap in context engineering" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One clarification matters before going further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; is not just prompt writing.&lt;/p&gt;

&lt;p&gt;At the broadest level, it is the practice of shaping what the model sees, in what order, with what boundaries, and under what assumptions. Prompts are one part of that. Retrieval is another. File selection, memory handoff, system instructions, and workspace state all belong to the same larger question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the model actually know right now?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By that definition, MICA is part of context engineering.&lt;/p&gt;

&lt;p&gt;Not because it retrieves context.&lt;br&gt;&lt;br&gt;
Not because it packs more tokens into a window.&lt;br&gt;&lt;br&gt;
But because it governs which context is allowed to shape the session, under what trust conditions, and with what record when those conditions are tested.&lt;/p&gt;

&lt;p&gt;That distinction matters, because most of the field has focused on a different layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. The Conversation Was Already Happening
&lt;/h2&gt;

&lt;p&gt;Context engineering is not a new idea, and MICA does not claim to have invented the conversation.&lt;/p&gt;

&lt;p&gt;The term was amplified by Andrej Karpathy and others, but the underlying practice — designing what the model sees, not just what you ask it — had already been emerging in serious AI work.&lt;/p&gt;

&lt;p&gt;Collection-first tools already existed. CTX is a useful example of that layer: it gathers relevant workspace material and delivers it to the model without manual copy-paste. It answers an important question well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the same time, some of the sharper practitioner writing was already moving beyond collection alone. One such example was an OpenAI Developer Community post by Serge Liatko, &lt;a href="https://community.openai.com/t/prompt-engineering-is-dead-and-context-engineering-is-already-obsolete-why-the-future-is-automated-workflow-architecture-with-llms/1314011" rel="noopener noreferrer"&gt;&lt;strong&gt;“Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete: Why the Future Is Automated Workflow Architecture with LLMs.”&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The value of that piece was not the author's status, but the precision of the problem it named: manually maintained context eventually reaches its ceiling. &lt;/p&gt;

&lt;p&gt;Once system state changes faster than humans can keep context aligned, the real question is no longer just how to collect context, but how to automate its ownership, maintenance, and validation as the system evolves.&lt;/p&gt;

&lt;p&gt;That was an important move forward.&lt;/p&gt;

&lt;p&gt;But one layer still remained underdefined.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. The Missing Layer Was Already Visible
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdr13fr8q5dff4jqhini.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdr13fr8q5dff4jqhini.png" alt="Gathering data vs governing data" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
The same missing layer had already shown up elsewhere.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/flamehaven01/your-agentic-stack-has-two-layers-it-needs-three-3h1"&gt;&lt;strong&gt;Your Agentic Stack Has Two Layers. It Needs Three&lt;/strong&gt;&lt;/a&gt;, I argued that the usual stack had matured around two strong layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP / tool calls — &lt;strong&gt;how&lt;/strong&gt; the agent talks to systems&lt;/li&gt;
&lt;li&gt;agent skills — &lt;strong&gt;what&lt;/strong&gt; the agent can do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But something was still missing above both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the layer that decides &lt;strong&gt;whether&lt;/strong&gt; the agent should do it, under what constraints, and toward what end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the layer of intent, authority, and governance.&lt;/p&gt;

&lt;p&gt;The same problem appears in context engineering.&lt;/p&gt;

&lt;p&gt;A context pipeline can be excellent at retrieval and still be weak at governance. It can gather the right files, summarize the right notes, and deliver the right-looking material — and still fail to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is authoritative?&lt;/li&gt;
&lt;li&gt;what is provisional?&lt;/li&gt;
&lt;li&gt;what must never be violated?&lt;/li&gt;
&lt;li&gt;what changed since last time?&lt;/li&gt;
&lt;li&gt;how does the session prove it loaded the governed archive at all?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not collection questions.&lt;/p&gt;

&lt;p&gt;They are governance questions.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Where CTX Stops and MICA Begins
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye2vmym08cd61jq7vn62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye2vmym08cd61jq7vn62.png" alt="Inside context engineering, beyond collection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CTX solves the collection problem.&lt;/p&gt;

&lt;p&gt;It answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a necessary layer. Without it, context management collapses back into manual copy-paste, repeated explanation, and fragile session startup.pro&lt;/p&gt;

&lt;p&gt;But it does not answer the next set of questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where did this context item come from, and can that claim be verified?&lt;/li&gt;
&lt;li&gt;What happens when a file changes between sessions?&lt;/li&gt;
&lt;li&gt;Which constraints are non-negotiable, and what is the consequence when they are violated?&lt;/li&gt;
&lt;li&gt;Who approved the last change to the archive, and can that decision be audited?&lt;/li&gt;
&lt;li&gt;When the session begins, how does the system confirm the archive was actually loaded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not retrieval questions.&lt;/p&gt;

&lt;p&gt;They are governance questions.&lt;/p&gt;

&lt;p&gt;That is the narrow but consequential difference.&lt;/p&gt;

&lt;p&gt;CTX collects context and delivers it.&lt;br&gt;&lt;br&gt;
MICA governs what trust that context carries, what invariants it must not violate, what happens when it changes, and how the session proves that the governed archive was actually loaded.&lt;/p&gt;

&lt;p&gt;One answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The other answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Under what rules does the AI operate — and what is the record when those rules are tested?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They are different layers. Neither replaces the other.&lt;/p&gt;


&lt;h2&gt;
  
  
  6. The Part That Was Still Open
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufv6hwpt82ohirmltadz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufv6hwpt82ohirmltadz.png" alt="The four gates of the governance layer" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A recurring theme in serious context-engineering discussion is that context cannot remain a hand-curated artifact forever. It has to become a function of system state.&lt;/p&gt;

&lt;p&gt;But that still leaves one question open:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Who owns the specification for each step's input — and how is this versioned, tested, and audited as requirements shift?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;MICA is a concrete, working answer to that gap.&lt;/p&gt;

&lt;p&gt;Not the only possible answer. Not necessarily the final one. But a real one.&lt;/p&gt;

&lt;p&gt;Its claim is not that context engineering needed to be invented.&lt;/p&gt;

&lt;p&gt;Its claim is that context engineering still needed a governance layer with at least four properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine-addressable invariants&lt;/li&gt;
&lt;li&gt;versioned and auditable change records&lt;/li&gt;
&lt;li&gt;self-tests against the real project state&lt;/li&gt;
&lt;li&gt;declared invocation with confirmable session load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the layer MICA was built to supply.&lt;/p&gt;


&lt;h2&gt;
  
  
  7. What Governance Actually Means Here
&lt;/h2&gt;

&lt;p&gt;Governance is an overloaded word. In this context it means something specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every context item must declare where it came from in a way that can be checked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auditability&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Changes to the archive are recorded when they happen, not reconstructed later from memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invariant enforcement&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Constraints are not vague README prose. They are structured entries with identity and severity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-testing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The archive is checked against the real project state, not only against its own internal shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invocation confirmation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The model does not silently ignore the archive. Session start requires a structured acknowledgment that the governed archive was loaded.&lt;/p&gt;

&lt;p&gt;None of these are abstract principles in MICA.&lt;br&gt;&lt;br&gt;
They are structural requirements in a running schema.&lt;/p&gt;


&lt;h2&gt;
  
  
  8. What This Is Not
&lt;/h2&gt;

&lt;p&gt;It is worth being precise about the boundary.&lt;/p&gt;

&lt;p&gt;MICA does &lt;strong&gt;not&lt;/strong&gt; generate context automatically from the codebase. That is not its job. Collection-first systems already exist, and they are valuable. MICA governs what happens once context has been identified.&lt;/p&gt;

&lt;p&gt;MICA does &lt;strong&gt;not&lt;/strong&gt; replace human judgment. A schema can require structure, audit trail, drift response, and self-tests. It cannot eliminate operator discipline. That is why the boundary between schema and playbook matters.&lt;/p&gt;

&lt;p&gt;MICA is also &lt;strong&gt;not&lt;/strong&gt; a finished system. Parts 1 through 4 of this series were explicit about what each version got wrong, what each version corrected, and what remained unresolved.&lt;/p&gt;

&lt;p&gt;That design history is part of the claim, not an embarrassment to it.&lt;/p&gt;


&lt;h2&gt;
  
  
  9. The Actual Claim
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2byeodafgh0grihwdak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2byeodafgh0grihwdak.png" alt="Operative Governance" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The actual claim is not that MICA solves all of context engineering.&lt;/p&gt;

&lt;p&gt;It is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A small operation can already run a governed AI context system — with verifiable provenance, deviation audit trail, structured invariants, self-testing, and declared invocation — without waiting for future tooling that does not yet exist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That claim is demonstrated by the design history already covered in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt; made scoring implementable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.5&lt;/strong&gt; brought governance structure into the schema&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.7&lt;/strong&gt; made scoring fail-closed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.8&lt;/strong&gt; made invocation declared and confirmable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.8.1&lt;/strong&gt; clarified the remaining runtime ambiguities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And in practice, governance at runtime can look as concrete as this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Gate: PASS (self-tests: 7/7) | Track: A,B
Critical Invariants: 3/3 | Deviations: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a critical check fails, the session does not proceed. That is what governance looks like when it becomes operative.&lt;/p&gt;

&lt;p&gt;What can be said now is narrower and more solid: the gap is real, collection-first systems solve one side of it, and MICA addresses the governance side. The conversation about what comes after context engineering was already happening; MICA is one concrete answer to that part of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. What Comes Next
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfq5h3kuzj7haxz1kbni.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfq5h3kuzj7haxz1kbni.png" alt="From landscape to concrete operation" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 4 ended with a specific question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where does MICA sit in the context engineering landscape that already existed around it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This post is the answer.&lt;/p&gt;

&lt;p&gt;It sits inside context engineering — but not at the collection layer.&lt;/p&gt;

&lt;p&gt;It does not compete with retrieval-first systems by trying to collect more files, pack more tokens, or automate more handoff. It begins after that layer. Its job is to govern what enters the session, what remains authoritative, what drift means, what violations matter, and how the model proves it actually loaded the governed archive at all.&lt;/p&gt;

&lt;p&gt;That is why the answer is narrower than most people expect, and more specific than most framings allow.&lt;/p&gt;

&lt;p&gt;MICA is not “the future of all context engineering.”&lt;br&gt;
It is a governance answer to the part of context engineering that collection alone does not solve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 6&lt;/strong&gt; will move back down from landscape to concrete operation.&lt;/p&gt;

&lt;p&gt;It will show what MICA looks like in an actual project context: what a session opening report looks like, what a deviation log entry looks like in practice, and what happens when a self-test flags drift.&lt;/p&gt;

&lt;p&gt;After that comes the harder question: what remains unresolved.&lt;/p&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jw1nvooyqfi7pz0ccdk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jw1nvooyqfi7pz0ccdk.png" alt="The named decision" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; Governance is not a layer you add after context engineering works. It is the layer that makes context engineering trustworthy — by declaring what is authoritative, recording what changes, and confirming what the AI actually loaded.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>contextengineering</category>
    </item>
  </channel>
</rss>
