<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kwansub Yun</title>
    <description>The latest articles on DEV Community by Kwansub Yun (@flamehaven01).</description>
    <link>https://dev.to/flamehaven01</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3508506%2Fe2f9bc29-10d2-41ec-8e77-19b8b5cfd9e9.jpg</url>
      <title>DEV Community: Kwansub Yun</title>
      <link>https://dev.to/flamehaven01</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/flamehaven01"/>
    <language>en</language>
    <item>
      <title>AI-SLOP Detector v3.5.0 — Every Claim, Verified Against Source Code</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 15 Apr 2026 06:19:37 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v350-every-claim-verified-against-source-code-1n94</guid>
      <description>&lt;p&gt;I published a LinkedIn post about AI-SLOP Detector's self-calibration system and download numbers. Someone asked the reasonable question: &lt;strong&gt;"Can you actually back that up?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Yes. Here's the source.&lt;/p&gt;

&lt;p&gt;This isn't a feature announcement. It's a line-by-line audit of seven claims against the actual codebase. Every VERDICT links to a real file and real line numbers. The repo is public — go check it yourself.&lt;/p&gt;




&lt;h2&gt;
  
  
  What was claimed
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Claim&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Every scan is recorded&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Repeat scans become calibration signal&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Updates only when signal is strong enough&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Visible policy artifact (&lt;code&gt;.slopconfig.yaml&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Explicit numeric limits govern calibration&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Detects empty/stub/phantom/disconnected code&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~1.4K downloads last week&lt;/td&gt;
&lt;td&gt;✅ TRUE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All seven. No fabrications. No inflated numbers. Here's the proof.&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 1: "Every scan is recorded"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 116–180&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;record&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;file_analysis&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_commit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;git_branch&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auto-invoked on every CLI run. The only opt-out is &lt;code&gt;--no-history&lt;/code&gt;. Each scan writes to SQLite at &lt;code&gt;~/.slop-detector/history.db&lt;/code&gt; and stores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;deficit_score&lt;/code&gt;, &lt;code&gt;ldr_score&lt;/code&gt;, &lt;code&gt;inflation_score&lt;/code&gt;, &lt;code&gt;ddc_usage_ratio&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n_critical_patterns&lt;/code&gt;, &lt;code&gt;fired_rules&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;git_commit&lt;/code&gt;, &lt;code&gt;git_branch&lt;/code&gt;, &lt;code&gt;project_id&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Schema is now at v5, auto-migrated on startup through every release from v2.9.0 to v3.5.0.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The record() call is real. The schema is versioned. The behavior is not optional.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 2: "Every re-scan becomes signal"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/history.py&lt;/code&gt;, lines 221–246&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;count_files_with_multiple_runs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Only files scanned &amp;gt;= 2 times count as calibration events
&lt;/span&gt;    &lt;span class="n"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="n"&gt;GROUP&lt;/span&gt; &lt;span class="n"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;file_path&lt;/span&gt; &lt;span class="n"&gt;HAVING&lt;/span&gt; &lt;span class="nc"&gt;COUNT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 301–309&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_extract_events&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_load_history&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;project_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;by_file&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_group_runs_by_file&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Single-scan files produce no calibration events. Only repeat scans generate &lt;code&gt;improvement&lt;/code&gt; or &lt;code&gt;fp_candidate&lt;/code&gt; labels. The threshold is hardcoded in SQL, not assumed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The repeat-scan requirement is enforced at the query level, not in documentation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 3: "Updates only when the signal is strong enough"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54 (constants) and 251–262 (enforcement)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;   &lt;span class="c1"&gt;# min gap between #1 and #2 candidate
&lt;/span&gt;&lt;span class="n"&gt;MIN_IMPROVEMENTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;       &lt;span class="c1"&gt;# improvement events required
&lt;/span&gt;&lt;span class="n"&gt;MIN_FP_CANDIDATES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;      &lt;span class="c1"&gt;# fp_candidate events required
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 1 — confidence gap check (line 251):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;insufficient_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Confidence gap &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;confidence_gap&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &amp;lt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CONFIDENCE_GAP&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Candidates are too close — need more history data for reliable calibration.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;  &lt;span class="c1"&gt;# NO UPDATE APPLIED
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Gate 2 — score delta check (line 262):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_score&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;winner_score&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no_change&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;  &lt;span class="c1"&gt;# also does not apply
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two independent guards. Both must pass before any weight update applies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Ambiguous signal is rejected twice before touching configuration.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 4: "Leaves behind a visible policy every time it changes"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, docstring line 17–18&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Return CalibrationResult&lt;span class="p"&gt;;&lt;/span&gt; optionally write to .slopconfig.yaml via &lt;span class="nt"&gt;--apply-calibration&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;--apply-calibration&lt;/code&gt; is passed and &lt;code&gt;status == "ok"&lt;/code&gt;, optimal weights are written to &lt;code&gt;.slopconfig.yaml&lt;/code&gt;. Plain-text YAML. Human-readable. Git-versionable. Every calibration change is a diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. The policy artifact is explicit. You can &lt;code&gt;git blame&lt;/code&gt; it.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 5: "Explicit limits govern calibration"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: &lt;code&gt;src/slop_detector/ml/self_calibrator.py&lt;/code&gt;, lines 37–54&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;MIN_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;             &lt;span class="c1"&gt;# minimum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_W&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;             &lt;span class="c1"&gt;# maximum allowed weight per dimension
&lt;/span&gt;&lt;span class="n"&gt;MAX_PURITY_WEIGHT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# purity ceiling
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_TOLERANCE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.15&lt;/span&gt;  &lt;span class="c1"&gt;# max per-dimension deviation from domain anchor
&lt;/span&gt;&lt;span class="n"&gt;DOMAIN_DRIFT_LIMIT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.25&lt;/span&gt; &lt;span class="c1"&gt;# warn when optimal weight drifts this far
&lt;/span&gt;&lt;span class="n"&gt;GRID_STEP&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;             &lt;span class="c1"&gt;# 0.05 increment resolution
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No ML model. No learned bounds. Every constraint is a named constant with a comment explaining why it exists. The calibration space is a bounded grid, not an open optimization landscape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Every limit is auditable. Nothing is opaque.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 6: "Detects empty implementations, phantom dependencies, disconnected pipelines"
&lt;/h2&gt;

&lt;p&gt;These are the three canonical defect patterns AI code generation produces at scale. Each has a dedicated module.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Defect class&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Empty/stub functions&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ldr.py&lt;/code&gt; — LDRCalculator detects &lt;code&gt;pass&lt;/code&gt;, &lt;code&gt;...&lt;/code&gt;, &lt;code&gt;raise NotImplementedError&lt;/code&gt;, &lt;code&gt;TODO&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phantom/unused imports&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/hallucination_deps.py&lt;/code&gt; — AST-based import vs usage analysis via &lt;code&gt;HallucinatedDependency&lt;/code&gt; dataclass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Disconnected pipelines&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/metrics/ddc.py&lt;/code&gt; — DDC (Declared Dependency Completeness) usage ratio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Function clone clusters&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;src/slop_detector/patterns/python_advanced.py&lt;/code&gt; — Jensen-Shannon Divergence on 30-dim AST histograms, JSD &amp;lt; 0.05 = clone&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The clone detection is worth noting. JSD on AST histograms catches structural duplication that string similarity misses entirely. LLMs produce a lot of this — same function logic, slightly renamed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Each defect class has a named module with a working implementation.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Claim 7: "~1.4K downloads in the past week"
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Source&lt;/strong&gt;: pypistats.org API (&lt;code&gt;mirrors=false&lt;/code&gt;), queried 2026-04-15&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;last_week&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;1,407  (mirrors excluded — actual pip install traffic)&lt;/span&gt;
&lt;span class="na"&gt;last_month&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;1,787&lt;/span&gt;
&lt;span class="na"&gt;last_day&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;   &lt;span class="m"&gt;83&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"~1.4K" is within 0.5% of 1,407. Mirrors excluded means bot traffic is stripped — these are real install invocations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;VERDICT: TRUE. Verified against pypistats in real time. The number is not rounded up.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this format exists
&lt;/h2&gt;

&lt;p&gt;Most open-source project posts make claims. Few back them up with file paths and line numbers.&lt;/p&gt;

&lt;p&gt;That gap is the same problem AI-SLOP Detector is built to close. AI-generated code makes claims too — functions that look complete, imports that look used, pipelines that look connected. Static analysis finds the gap between what the code says and what it does.&lt;/p&gt;

&lt;p&gt;This post applies the same standard to the project's own marketing copy. If a claim can be verified, it should be. If it can't, it shouldn't be made.&lt;/p&gt;

&lt;p&gt;The codebase is public: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;github.com/flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Pull requests welcome. Audits welcome more.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Verified by static code analysis + pypistats API, 2026-04-15&lt;/em&gt;&lt;/p&gt;

</description>
      <category>aitools</category>
      <category>opensource</category>
      <category>codequality</category>
      <category>python</category>
    </item>
    <item>
      <title>It Gets Smarter Every Scan: AI-SLOP Detector v3.5.0 and the Self-Calibration Loop</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 13 Apr 2026 16:15:49 +0000</pubDate>
      <link>https://dev.to/flamehaven01/it-gets-smarter-every-scan-ai-slop-detector-v350-and-the-self-calibration-loop-3fia</link>
      <guid>https://dev.to/flamehaven01/it-gets-smarter-every-scan-ai-slop-detector-v350-and-the-self-calibration-loop-3fia</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ombaq79ho65mgbtjqyg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ombaq79ho65mgbtjqyg.png" alt="cover" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; 🔻&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n"&gt;v3.1.0 — Three Formula Refinements and the Adversarial Tester That Found Them&lt;/a&gt; · &lt;br&gt;
🔻&lt;a href="https://dev.to/flamehaven01/the-tool-that-turned-on-itself-ai-slop-detector-v290-v291-3oc4"&gt;v2.9.0/v2.9.1 — The Tool That Turned On Itself&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By late 2025, everyone was building with AI. A weekend was enough to launch a SaaS app, and by Monday it was already on Product Hunt. The code looked finished, the UI worked, and the demo landed. That was also the problem.&lt;/p&gt;

&lt;p&gt;In 2026, some of the consequences started arriving in public. Exposed databases, weak security boundaries, brittle automation, and production systems that looked polished enough to ship but had clearly not been understood at the level their surface confidence implied. Not every one of those failures belongs to static analysis, and it would be too easy to pretend otherwise. But many of them still point to the same upstream condition: code that looks complete long before it deserves trust.&lt;/p&gt;

&lt;p&gt;That is the layer this release is about.&lt;/p&gt;




&lt;h2&gt;
  
  
  The breach is the headline. The review gap is the story.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ec93in1avitflyntlvp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ec93in1avitflyntlvp.png" alt="Structurally plausible, functionally thin" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A missing security rule is not the same thing as a stubbed auth function. A runtime-only bug is not the same thing as a phantom import. &lt;/p&gt;

&lt;p&gt;A broken architecture is not the same thing as a buzzword-heavy helper. These are different failure classes, and any serious tool has to respect that difference.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag337gfzcfw1zdjiancq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fag337gfzcfw1zdjiancq.png" alt="output scales while oversight stagnates" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What they often share, though, is the review environment that let them through. AI increased output volume, increased speed, and increased surface polish. &lt;/p&gt;

&lt;p&gt;Review depth did not increase with it. That matters because AI-generated code has a very recognizable habit: it often looks complete before it is complete.&lt;/p&gt;

&lt;p&gt;It compiles. It passes tests. It sounds like it knows what it is doing. Then you open the function.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;calculate_quality_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Advanced multi-dimensional quality assessment using
    proprietary algorithms with statistical normalization,
    entropy-based weighting, and dynamic threshold calibration.
    Returns a score between 0 and 100.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# TODO: implement the actual algorithm
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;85.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not noisy code. It is confident emptiness. In an analytics path, it becomes false certainty. In a payment path, it becomes a defect. In an auth path, it becomes risk. &lt;/p&gt;

&lt;p&gt;The issue is not that AI writes ugly code. The issue is that AI reliably produces code that is structurally plausible while functionally thin.&lt;/p&gt;

&lt;p&gt;That is a narrower claim than “AI is dangerous,” but it is also far more useful.&lt;/p&gt;




&lt;h2&gt;
  
  
  We ran into this ourselves
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc91ne3b4usfndqmn5p0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frc91ne3b4usfndqmn5p0.png" alt="4-dimensional weighted geometric mean" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This did not begin as a theory about other people’s repos. It began when we found a flaw in our own scoring model. Back in v2.8.0, we discovered that our formula was accidentally rewarding spaghetti code. &lt;/p&gt;

&lt;p&gt;A large god function could sometimes look healthier than a small clean function because complexity was dividing the penalty instead of amplifying it.&lt;/p&gt;

&lt;p&gt;That was backwards, so the math changed.&lt;/p&gt;

&lt;p&gt;AI-SLOP Detector now evaluates four dimensions: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LDR&lt;/strong&gt; for logic density
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inflation&lt;/strong&gt; for jargon density relative to real logic
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DDC&lt;/strong&gt; for dependency usage rather than dependency presence
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Purity&lt;/strong&gt; for critical structural defects that should drag the whole score down
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are combined with a &lt;strong&gt;weighted geometric mean&lt;/strong&gt;, not an arithmetic average.&lt;/p&gt;

&lt;p&gt;Why that matters:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;one strong-looking axis should not be able to hide a collapsed one
&lt;/li&gt;
&lt;li&gt;a polished docstring should not rescue empty logic
&lt;/li&gt;
&lt;li&gt;if one important dimension fails, the whole score should feel it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the scoring philosophy underneath the tool. But even that was not enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  Static analyzers have a threshold problem
&lt;/h2&gt;

&lt;p&gt;Take a perfectly legitimate ML helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;prepare_training_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;raw_samples&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;List&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PreTrainedTokenizer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;512&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Tokenize and pad samples for transformer training.
    Handles attention mask generation and HuggingFace
    tokenizer conventions for batch encoding.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;raw_samples&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;padding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;truncation&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;return_tensors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There is nothing wrong with this code. But a generic detector may still overreact, because terms like &lt;code&gt;tokenizer&lt;/code&gt;, &lt;code&gt;attention mask&lt;/code&gt;, and &lt;code&gt;HuggingFace&lt;/code&gt; can look suspicious if the analyzer does not understand the domain it is scanning. In a real ML codebase, those terms are normal. In a CRUD backend, some of them may be genuine anomaly signals.&lt;/p&gt;

&lt;p&gt;That is the threshold problem. The same threshold can be wrong in one codebase and exactly right in another. A universal threshold sounds elegant, but real repositories are local. They have habits, idioms, and boilerplate that are legitimate inside one domain and suspicious inside another.&lt;/p&gt;

&lt;p&gt;So the next problem became obvious: the tool had to learn the project it was scanning. That is the real center of v3.5.0.&lt;/p&gt;




&lt;h2&gt;
  
  
  What AI-SLOP Detector actually does
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1e70mgohgkb3xwldop.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9s1e70mgohgkb3xwldop.png" alt="scanning for structrural integrity" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP Detector&lt;/a&gt; is a static analyzer built to catch a specific defect class that shows up repeatedly in AI-generated code: unimplemented stubs, disconnected pipelines, phantom imports, clone-shaped emptiness, placeholder-heavy production paths, and jargon inflation that outruns the actual logic. It is not a style linter, not a full security scanner, and not a runtime verifier. It is a detector for &lt;strong&gt;structural hollowness&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That distinction matters because it keeps the claim honest. The tool is not trying to solve every production risk. It is trying to catch one layer that becomes more expensive as AI output scales faster than human review.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector
slop-detector &lt;span class="nt"&gt;--init&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt;&lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The workflow is the product story
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2ilk1rr0i2hozydca4o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl2ilk1rr0i2hozydca4o.png" alt="why universal rules fail real repositories" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What makes this release interesting is that it is not just “more patterns” or “more language support.” It is a workflow story.&lt;/p&gt;

&lt;p&gt;The detector now has a real loop. It scans the file, classifies its role, computes the 4D score, applies structural pattern penalties, and writes the result to history. Then, once enough repeated scans exist, it revisits that history, extracts behavioral signals, tunes the weights inside bounded domain-aware limits, updates the configuration, and keeps scanning. That is the release.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fq4x5gkdfp5mwe5nfcl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fq4x5gkdfp5mwe5nfcl.png" alt="mermaid1" width="800" height="1345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That final stretch is what changed this from “detector upgrade” into “adaptive detector.” The tool no longer only evaluates code. It also learns from what happens after evaluation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Self-calibration is the real headline
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql1dkuxvotgmm6bnzc4x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fql1dkuxvotgmm6bnzc4x.png" alt="Mechanical self-calibration" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every scan is recorded to a local SQLite history database. That history is not just there for reporting. It becomes the signal surface for the next tuning step. Once enough repeated scans accumulate, the detector begins asking a simple question: when this file was flagged, what happened next?&lt;/p&gt;

&lt;p&gt;That produces two behavior-derived event types. An &lt;strong&gt;improvement event&lt;/strong&gt; means the file was flagged, later changed, and its deficit dropped meaningfully. A &lt;strong&gt;false-positive candidate&lt;/strong&gt; means the file was flagged, then scanned again with the same content and little meaningful score movement.&lt;/p&gt;

&lt;p&gt;That difference is more important than it sounds. A lot of “self-improving” systems quietly learn from their own outputs. They mark something suspicious, then later use that same judgment as the truth signal for tuning. The system becomes better at agreeing with itself. That is not calibration. That is self-imitation with cleaner packaging.&lt;/p&gt;

&lt;p&gt;v3.5.0 tries to avoid that trap. Its labels are not taken from the scoring formula. They are inferred from developer behavior around repeated scans. The formula says, “this looks suspicious.” The next run reveals whether a human treated that suspicion as real.&lt;/p&gt;

&lt;p&gt;That signal is not perfect. An unchanged file is not always a false positive. It may be legacy code, low priority, or simply out of scope. But it is still a healthier signal than teaching the formula to imitate its own prior outputs.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the loop actually looks like
&lt;/h2&gt;

&lt;p&gt;The loop is not mystical. It is mechanical. Repeated scans accumulate, improvement and likely-FP events are extracted, candidate weight sets are evaluated, the search is bounded around the project’s current domain anchor, and if a strong enough winner appears, the config gets updated. If a calibrated weight drifts too far from the domain anchor, the system emits a warning.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfpv792zuufs9nm0czgz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmfpv792zuufs9nm0czgz.png" alt="mermaid2" width="800" height="2640"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is what makes the title true. It gets smarter every scan, not because a hidden model is hallucinating taste, but because repeated use creates a bounded feedback loop. That is much less magical, and much more trustworthy.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why &lt;code&gt;--init&lt;/code&gt; matters more now
&lt;/h2&gt;

&lt;p&gt;There is another reason the calibration story works better in v3.5.0. The detector no longer starts from a generic nowhere. &lt;code&gt;--init&lt;/code&gt; now performs domain-aware bootstrap, detects the likely project type, and seeds the starting weights accordingly. That means calibration starts near the right neighborhood instead of wandering across the whole map.&lt;/p&gt;

&lt;p&gt;That improves the first week of use, not just the tenth. And that matters, because bad first impressions kill adaptive tools. If the detector is only smart after a month of annoying you, it will never survive long enough to get smart.&lt;/p&gt;

&lt;p&gt;Good initialization is not a convenience feature. It is part of whether the loop can gather clean signal at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  JS, TS, and Go are not side quests
&lt;/h2&gt;

&lt;p&gt;v3.5.0 also expands analysis coverage to Go, JS, JSX, TS, and TSX. That is useful on its own, but the deeper significance is architectural. Structurally hollow AI-generated code is not a Python-only phenomenon. If the detector’s long-term direction is project-local calibration rather than one-size-fits-all scoring, then wider language support is not a side feature. It is the natural expansion of the same idea.&lt;/p&gt;

&lt;p&gt;Different languages. Same review gap. Same loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The honest boundary
&lt;/h2&gt;

&lt;p&gt;This tool still does &lt;strong&gt;not&lt;/strong&gt; close every gap. It will not fix missing infrastructure controls, catch every runtime bug, prove the architecture is correct, or replace security review. A clean structural profile is not proof of safety.&lt;/p&gt;

&lt;p&gt;What it can do is narrow one expensive blind spot: the distance between code that &lt;strong&gt;looks finished&lt;/strong&gt; and code that carries enough actual logic to deserve confidence. That is a smaller claim than “AI risk solved,” but it is also the kind of claim that survives production better.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters now
&lt;/h2&gt;

&lt;p&gt;AI has made software generation dramatically cheaper. It has not made understanding cheaper. That difference is where governance debt begins to accumulate.&lt;/p&gt;

&lt;p&gt;If teams can now generate far more code than they can truly review, then the review stack needs tools that operate below style and above syntax. Not tools that ask whether the code is pretty, but tools that ask whether the implementation carries enough substance for the confidence wrapped around it.&lt;/p&gt;

&lt;p&gt;That is the space AI-SLOP Detector is trying to occupy. Not the whole problem. Just one layer that became impossible to ignore.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector
&lt;span class="nb"&gt;cd &lt;/span&gt;my-project/
slop-detector &lt;span class="nt"&gt;--init&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Fix what clearly deserves fixing. Leave legitimate idioms alone. Then keep scanning. If the loop is doing its job, the next pass should know your codebase a little better than the first one did.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;GitHub: flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;

</description>
      <category>devtool</category>
      <category>architecture</category>
      <category>ai</category>
      <category>governance</category>
    </item>
    <item>
      <title>Can AI Review Physics? Yes — That Is Why We Built SPAR</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Sun, 12 Apr 2026 09:34:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/can-ai-review-physics-yes-that-is-why-we-built-spar-1ojk</link>
      <guid>https://dev.to/flamehaven01/can-ai-review-physics-yes-that-is-why-we-built-spar-1ojk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;A standalone review framework for checking whether outputs deserve the claims attached to them.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Most review systems answer a familiar question:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Did the system still produce the expected output?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;SPAR is built for a narrower and more dangerous one:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Does the output still deserve the claim attached to it?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is the split. Not reliability alone, but &lt;strong&gt;admissibility&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In practical terms, admissibility means &lt;strong&gt;claim-worthiness&lt;/strong&gt;: whether a result justifies the interpretation, governance status, or scientific statement built on top of it.&lt;/p&gt;

&lt;p&gt;A system can be reliable and inadmissible at the same time.&lt;/p&gt;

&lt;p&gt;A physics engine can compute &lt;code&gt;beta_G_norm&lt;/code&gt;, return zero, pass regression, and stay green across the whole pipeline. The report can still say the background is admissible. But if the function producing &lt;code&gt;beta_G_norm&lt;/code&gt; is a stub that always returns zero, the output is stable while the claim attached to it is false.&lt;/p&gt;

&lt;p&gt;That is not hypothetical. It is one concrete class of review failure SPAR was designed to surface.&lt;/p&gt;




&lt;h2&gt;
  
  
  What SPAR Is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F140n2kz9fwvslapkmc6m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F140n2kz9fwvslapkmc6m.png" alt="What SPAR Is" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SPAR (Sovereign Physics Autonomous Review)&lt;/strong&gt; is a deterministic framework for &lt;strong&gt;claim-aware review&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It does not replace unit tests. It does not replace regression benchmarks. It does not replace scoring systems. It reviews a different object:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the output&lt;/li&gt;
&lt;li&gt;the claim attached to that output&lt;/li&gt;
&lt;li&gt;the implementation state behind it&lt;/li&gt;
&lt;li&gt;the maturity state that should travel with it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;SPAR started inside Flamehaven-TOE, an open physics simulation and AI governance engine. It has since been extracted into a &lt;strong&gt;standalone open-source framework&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/flamehaven01/SPAR-Framework" rel="noopener noreferrer"&gt;github.com/flamehaven01/SPAR-Framework&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The framework includes a generic review kernel, explicit score and verdict policy, registry-backed review surfaces, and a first domain adapter for physics. Physics is the first adapter and the first domain where this review model was stress-tested. It is not the limit of the framework.&lt;/p&gt;

&lt;p&gt;The core idea is simpler than the name: &lt;strong&gt;an output can pass while the claim attached to it drifts.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Ordinary Review Is Not Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fke30jyn0seb4vvlwpp0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fke30jyn0seb4vvlwpp0i.png" alt="Reliability ≠ Truth" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ordinary review is usually shallow by necessity. It asks questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;did the code run&lt;/li&gt;
&lt;li&gt;did the output shape stay valid&lt;/li&gt;
&lt;li&gt;did the score remain within bounds&lt;/li&gt;
&lt;li&gt;did regression remain green&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are necessary checks. They are not always enough.&lt;/p&gt;

&lt;p&gt;The failure SPAR cares about is not, in the first instance, a crash. It is not even always a wrong number. It is this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the code executes&lt;/li&gt;
&lt;li&gt;the output looks plausible&lt;/li&gt;
&lt;li&gt;the tests pass&lt;/li&gt;
&lt;li&gt;and the interpretation is still overstated or structurally false&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That failure can appear in several ways: a placeholder implementation returns stable-looking values; a maturity registry stays stale after the implementation improves; a score looks smooth before its epistemic basis is strong enough to justify the interpretation attached to it; an approximation gets reported as closure.&lt;/p&gt;

&lt;p&gt;None of these failures is spectacular. That is exactly why they are easy to miss.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Minimal Divergence
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe774mdtlkafs2zktlob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwe774mdtlkafs2zktlob.png" alt="Ordinary Review vs SPAR" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The clearest way to see the difference is in review form.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"ordinary_regression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"kernel_exec"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"output_contract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"PASS"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"test_suite_status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GREEN"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"spar_review"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_a"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONSISTENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_b"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CONSISTENT"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"layer_c"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"GAP_STATE_MISMATCH"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"finding"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Implementation path is genuine; registry classification remains stale"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"required_action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"RECLASSIFY"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ordinary regression says the system still works. SPAR says the system may no longer be describing its own computation truthfully.&lt;/p&gt;

&lt;p&gt;SPAR is not "tests, but harsher." It can produce a &lt;strong&gt;different review outcome&lt;/strong&gt; even when ordinary regression remains green. In this case, the required action is not rejection. It is reclassification. That is not the same as testing harder. It is reviewing a different object.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Core Mismatch Classes
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bx0i61pzmsgs7ad123q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6bx0i61pzmsgs7ad123q.png" alt="What We Catch" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPAR treats three mismatch classes as first-class review objects.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anchor mismatch&lt;/strong&gt; — the output conflicts with a declared analytical or contractual anchor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Interpretation mismatch&lt;/strong&gt; — the report language claims more than the implementation state justifies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maturity mismatch&lt;/strong&gt; — the implementation, registry, and outward-facing claim have drifted out of sync.&lt;/p&gt;

&lt;p&gt;Ordinary review mostly checks whether a system still passes. SPAR checks whether the result is still being described honestly.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Layer Structure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6awaygug9q9gpiivqu4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc6awaygug9q9gpiivqu4.png" alt="Deterministic, Not LLM-Judged" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer A — Anchor Consistency
&lt;/h3&gt;

&lt;p&gt;Layer A checks whether output agrees with a declared analytical or contractual anchor. The expected value is not "whatever the engine produced last time." It is "what the declared contract says must appear for this background, under this formulation."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Flat Minkowski: beta residuals must vanish
&lt;/span&gt;&lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;CONSISTENT&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;abs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;beta_G&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mf"&gt;1e-4&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;ANOMALY&lt;/span&gt;

&lt;span class="c1"&gt;# de Sitter: admissibility gate must FAIL
# A PASS here indicates a gate defect, not a success.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Layer A tests agreement with a declared reference contract — not truth in some unconstrained universal sense. Analytical anchors depend on regime, normalization, and formulation. That distinction matters. Still, the engineering value is clear: reliability can remain intact while anchor-consistency fails. A Layer A anomaly means the output contradicts the contract the system claims to be using.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer B — Interpretation Validity
&lt;/h3&gt;

&lt;p&gt;Layer B checks whether the interpretation attached to the output stays within declared scope. This layer is &lt;strong&gt;deterministic&lt;/strong&gt; — it does not rely on a free-form LLM judge. It uses explicit rule tables over structured runtime artifacts, maturity states, and report text.&lt;/p&gt;

&lt;p&gt;Typical checks: does the report claim full closure while the path is still heuristic or partial; is a bounded approximation being described as exact; is an environment-conditional bridge being written up as universal; are overclaim phrases appearing where runtime state does not support them.&lt;/p&gt;

&lt;p&gt;Layer B does not eliminate semantic ambiguity. What it does is narrow the problem from "solve rhetoric in general" to "enforce explicit admissibility contracts against declared model states." That makes it auditable. Not complete. Auditable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer C — Existence and Maturity Probes
&lt;/h3&gt;

&lt;p&gt;Layer C asks what kind of implementation produced the result: genuine, approximate, gapped, environment-conditional, or research-only.&lt;/p&gt;

&lt;p&gt;This is where SPAR becomes especially different from ordinary review. It does not merely score outputs. It checks the &lt;strong&gt;ontological status&lt;/strong&gt; of the path that produced them. A result from a known-limited path is not the same thing as a result from a genuine path. A research probe is not production-grade closure. A dependency-bound bridge is not a universal capability. Those distinctions change what the output is allowed to claim.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why the Registry Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtrpjd5figmbutwl658a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdtrpjd5figmbutwl658a.png" alt="Machine-Readable Governance" width="800" height="447"&gt;&lt;/a&gt;&lt;br&gt;
A framework like this needs more than score outputs. It needs structured state that can travel with the result.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ARCHITECTURE_GAPS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C4_sidrce_omega&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SIDRCE Omega (v4.5+): chi-squared Gaussian model. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score = exp(-chi2/2), chi2 = (||beta||/tol)^2. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Derived from zero-centered Gaussian likelihood. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GAP: tolerance values are calibration parameters with &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;no first-principles derivation of their exact magnitude.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;C8_rg_linearized&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;RG flow: linearized dilaton ODE only. Metric does NOT flow. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROXIMATION: valid only for small perturbations &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;around a fixed background geometry.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A machine-readable registry turns caveat prose into runtime surface. That lets review results carry explicit maturity labels — &lt;code&gt;open&lt;/code&gt;, &lt;code&gt;partial&lt;/code&gt;, &lt;code&gt;closed&lt;/code&gt;, &lt;code&gt;environment_conditional&lt;/code&gt;, &lt;code&gt;research_only&lt;/code&gt; — rather than vague prose. Without that surface, approximation and closure collapse into the same sentence.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scoring Policy Is Explicit
&lt;/h2&gt;

&lt;p&gt;SPAR keeps score policy visible. No hidden learned weights.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;SCORE_TABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ANOMALY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;       &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# contradicts an analytical anchor
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;FAIL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# review-layer failure
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;GAP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;            &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# honest gap disclosure
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# bounded concern
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;APPROXIMATION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="c1"&gt;# known simplification
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;journal_verdict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer_a&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;count_anomalies&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;layer_a&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REJECT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;85&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ACCEPT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MINOR REVISION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;                    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;MAJOR REVISION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REJECT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These are not laws of nature. They are review policy. A hidden learned scorer may feel sophisticated. An explicit policy is easier to inspect, debate, and change.&lt;/p&gt;

&lt;p&gt;Two or more Layer A anomalies trigger unconditional REJECT regardless of total score. Mathematical contract failures are not averaged away by cleaner signals elsewhere.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Example: The Omega Score Transition
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh145cyc50hcjmaxkk4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdh145cyc50hcjmaxkk4c.png" alt="The Omega Score Transition" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Flamehaven-TOE's primary governance metric, SIDRCE Omega, once relied on a large arbitrary multiplicative constant applied to the raw residual. The outputs looked stable. Nothing felt obviously broken.&lt;/p&gt;

&lt;p&gt;SPAR still flagged it. Stability was not the right question. The stronger question was whether the formula justified the interpretation being attached to it. A free scaling constant with no physical derivation is not the same thing as a physically motivated model.&lt;/p&gt;

&lt;p&gt;The formula was replaced with a chi-squared Gaussian construction:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;gs_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_G_norm&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_G&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;b_score&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_B_norm&lt;/span&gt;  &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;si_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;beta_result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;beta_Phi_norm&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;tol_Phi&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;omega_physics&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;gs_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;b_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;si_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That change matters because it introduces a reversible relation to the underlying residual:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;||beta|| = tol * sqrt(-2 * ln(score))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Given a reported score, the residual is recoverable. The score is no longer just a presentation layer. It encodes a falsifiable relation to the quantity beneath it.&lt;/p&gt;

&lt;p&gt;SPAR did not respond by declaring the problem solved. It updated the classification precisely: the formula ceased to be arbitrary, but the remaining gap shifted to the tolerance scales, which are still calibration parameters. That is a narrower and more honest claim than either "arbitrary" or "fully resolved."&lt;/p&gt;

&lt;p&gt;That is exactly the kind of distinction SPAR is built to review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; .[dev]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spar_framework.engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_review&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;spar_domain_physics.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_review_runtime&lt;/span&gt;

&lt;span class="n"&gt;runtime&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_review_runtime&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_review&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;runtime&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;subject&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_G_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_B_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;beta_Phi_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sidrce_omega&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eft_m_kk_gev&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;1.0e16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ricci_norm&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flat minkowski&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;gate&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;PASS&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;report_text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bounded report text.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;verdict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model_registry_snapshot&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;total_models&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The review result carries more than pass/fail: a verdict, a score, and a maturity-aware review surface. That surface is what makes the output governable rather than just evaluable.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where This Framework Fits
&lt;/h2&gt;

&lt;p&gt;Physics remains the first adapter and strongest early testbed. The review pattern is broader.&lt;/p&gt;

&lt;p&gt;It fits anywhere outputs can pass while the attached claim can drift: scientific computing pipelines, PDE and simulation workflows, scientific ML surrogates, inverse and calibration models, AI code review, model governance, regulated analytics and reporting.&lt;/p&gt;

&lt;p&gt;That does &lt;strong&gt;not&lt;/strong&gt; mean every team needs the full framework. Often the first useful step is smaller.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Lightweight Adoption Path
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkia6qy78ou1b90gohdut.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkia6qy78ou1b90gohdut.png" alt="Pragmatic Adoption Path" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 1 — Claim Check.&lt;/strong&gt; Add three explicit questions to an existing workflow: What is the output actually claiming? Does that claim match the implementation state? Is this result exact, approximate, partial, or heuristic? Most teams can do this immediately with no new tooling.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 2 — Maturity Labels.&lt;/strong&gt; Attach state labels to results: &lt;code&gt;heuristic&lt;/code&gt;, &lt;code&gt;partial&lt;/code&gt;, &lt;code&gt;closed&lt;/code&gt;, &lt;code&gt;environment_conditional&lt;/code&gt;. A small registry. Already a meaningful step beyond ordinary review.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Level 3 — Full SPAR.&lt;/strong&gt; Layer A anchor consistency, Layer B interpretation validity, Layer C existence and maturity probes, registry-backed snapshots, explicit score and verdict policy.&lt;/p&gt;

&lt;p&gt;SPAR can be used as a review habit before it is adopted as a full framework.&lt;/p&gt;




&lt;h2&gt;
  
  
  What SPAR Does Not Do
&lt;/h2&gt;

&lt;p&gt;SPAR does not provide a universal truth engine, free-form LLM judging in the core, domain contracts inside the generic kernel, or certainty about whether a scientific claim is true in all possible senses.&lt;/p&gt;

&lt;p&gt;SPAR is not a machine for declaring truth. Its narrower goal is to make &lt;strong&gt;claim drift reviewable&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  Reliability Is Not Enough
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk32aoci0joww2uhbcqyz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk32aoci0joww2uhbcqyz.png" alt="Make Claim Drift Reviewable" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Reliability asks whether a system produces stable, repeatable outputs. Admissibility asks whether those outputs deserve the meanings attached to them.&lt;/p&gt;

&lt;p&gt;A stub that always returns zero can be reliable. A heuristic threshold can be reliable. A smoothly calibrated score can be reliable. None of those facts alone makes the resulting claim justified.&lt;/p&gt;

&lt;p&gt;Current AI and scientific tooling are already better at measuring reliability than admissibility. That asymmetry is understandable — reliability is easier to benchmark, easier to automate, easier to ship in CI. But admissibility is where silent approximations, overstated claims, and maturity mismatches accumulate.&lt;/p&gt;

&lt;p&gt;SPAR is one working answer to that problem. Not a universal answer. A technical one.&lt;/p&gt;

&lt;p&gt;It turns implementation state, maturity state, analytical anchoring, and scope honesty into review objects that can travel with the result.&lt;/p&gt;

&lt;p&gt;That is why the architecture may matter outside the domain that produced it.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/SPAR-Framework" rel="noopener noreferrer"&gt;github.com/flamehaven01/SPAR-Framework&lt;/a&gt;&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Docs:&lt;/strong&gt; &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/WHAT_IS_SPAR.md" rel="noopener noreferrer"&gt;What Is SPAR&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/ADMISSIBILITY.md" rel="noopener noreferrer"&gt;Admissibility&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/PHYSICS_PROOF_CASE.md" rel="noopener noreferrer"&gt;Physics as the Proof Case&lt;/a&gt; · &lt;a href="https://github.com/flamehaven01/SPAR-Framework/blob/main/docs/USE_CASES.md" rel="noopener noreferrer"&gt;Use Cases&lt;/a&gt;&lt;/p&gt;

</description>
      <category>governance</category>
      <category>ai</category>
      <category>verification</category>
      <category>computerscience</category>
    </item>
    <item>
      <title>AI SLOP Detector v3.1: Three Formula Refinements and the Adversarial Tester That Found Them</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:29:57 +0000</pubDate>
      <link>https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n</link>
      <guid>https://dev.to/flamehaven01/ai-slop-detector-v31-three-formula-refinements-and-the-adversarial-tester-that-found-them-5e2n</guid>
      <description>&lt;p&gt;We shipped v2.9.0 with a scoring engine we trusted. We ran tests. Everything passed.&lt;/p&gt;

&lt;p&gt;Then we built a tool specifically designed to find cases where the score was &lt;em&gt;less precise than it could be&lt;/em&gt; — and it found three.&lt;/p&gt;

&lt;p&gt;This is the story of v3.1.0. And the patch that followed six hours later.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmla75gj6xobi0was3209.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmla75gj6xobi0was3209.png" alt="cover image" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Glossary — internal terminology used throughout this post
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Term&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deficit score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The final output of the scorer. 0 = structurally clean, 100 = critical. Derived as &lt;code&gt;100 × (1 - GQG)&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GQG&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Geometric Quality Gate. A weighted geometric mean of LDR, Inflation quality, DDC, and Purity. The single formula the scorer evaluates.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LDR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Logic Density Ratio. Ratio of executable logic lines to total lines. Low LDR = file is mostly stubs, blanks, or comments.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inflation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric that flags jargon-heavy docstrings unsupported by actual code complexity. A 2-line function with a 30-line docstring using 12 buzzwords scores badly.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Dead/Duplicate Code ratio. Tracks unreachable paths, copy-pasted blocks, phantom imports.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern hit rate. How many structural anti-patterns (god functions, stub returns, nested complexity) fire on the file.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cyclomatic Complexity (CC)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Count of independent code paths. A straight-line function = CC 1. Each &lt;code&gt;if&lt;/code&gt;, &lt;code&gt;for&lt;/code&gt;, &lt;code&gt;while&lt;/code&gt;, &lt;code&gt;except&lt;/code&gt; adds 1.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;fhval&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;flamehaven-validator. An external tool that interrogates the scorer from outside the codebase. Its purpose is to catch cases where internal test consistency masquerades as correctness.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SPAR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Subcommand of fhval. Adversarial regression loop with three layers. Tests whether the scorer measures what it claims to measure.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;JSD&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jensen-Shannon Divergence. A symmetric, bounded (0–1) measure of divergence between two probability distributions. Used here to compare AST node-type histograms between functions.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AST&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Abstract Syntax Tree. The parsed structure of source code. An &lt;code&gt;if&lt;/code&gt; statement, a &lt;code&gt;return&lt;/code&gt;, a function call each become typed nodes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;function_clone_cluster&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New pattern in v3.1.0. Detects files where many functions share near-identical AST structure — the fragmented god function evasion pattern.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;placeholder_variable_naming&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;New pattern in v3.1.0. Detects vocabulary-clean code with zero semantic content: single-letter parameter floods, sequential numbered variables.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AM/GM gap&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Core refinement in v3.1.0. The calibrator used an arithmetic mean (simpler approximation); the scorer uses a geometric mean (the precise target formula). Aligning them closes a ~5-7pt estimation gap on uneven files.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Quick context
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI SLOP Detector&lt;/a&gt; is a static analyzer that measures structural code quality — not style, not formatting. It scores each file across four dimensions and assigns a &lt;code&gt;deficit&lt;/code&gt; between 0 (clean) and 100 (critical):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dimension&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LDR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Ratio of executable logic to total lines&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Inflation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Jargon, docstring bloat, unsupported claims&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DDC&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unreachable paths, copy-pasted blocks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Purity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern hit rate (stubs, god functions, etc.)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These four numbers feed a single formula — a weighted geometric mean — called the &lt;strong&gt;GQG&lt;/strong&gt;. The output is the deficit score: &lt;code&gt;100 × (1 - GQG)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The calibrator's job is to find the best weights for that formula by searching over thousands of known cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before v3.1.0: the self-scan
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefrv0jxl1j4mnk2q1bnk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fefrv0jxl1j4mnk2q1bnk.png" alt="3critical blind spots discovered" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We don't ship a version without running the detector against itself. Before cutting v3.1.0, we ran v3.0.3 — a structural debt reduction pass on the three highest-deficit files in the codebase.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Self-scan before: avg_deficit=23.57, 15 deficit files, status=suspicious
Self-scan after:  avg_deficit=20.33, 12 deficit files, status=clean
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;analysis/cross_file.py&lt;/code&gt; dropped from 70.3 to 28.7 (critical → clean). &lt;code&gt;ci_gate.py&lt;/code&gt; from 69.3 to 22.3. &lt;code&gt;cli.py&lt;/code&gt; from 68.4 to 20.9. The fixes were mechanical: extracted nested closures to private methods, replaced &lt;code&gt;if/elif/else&lt;/code&gt; dispatch chains with dict dispatch, removed re-declared constants.&lt;/p&gt;

&lt;p&gt;The point is not that these numbers are good. It's that the tool had to earn its own PASS before we shipped the version that refines the formula. Shipping a scoring engine while your own codebase sits at &lt;code&gt;suspicious&lt;/code&gt; would have been its own kind of slop.&lt;/p&gt;




&lt;h2&gt;
  
  
  The adversarial tester: fhval SPAR
&lt;/h2&gt;

&lt;p&gt;In a &lt;a href="https://dev.to/flamehaven01/i-built-an-ecosystem-of-46-ai-assisted-repos-then-i-realized-it-might-be-eating-itself-46ni"&gt;previous post&lt;/a&gt; we described &lt;code&gt;fhval&lt;/code&gt; — flamehaven-validator. The core concern: when every tool in an ecosystem is built by the same person against the same baseline, internal consistency can masquerade as correctness. Passing your own tests proves nothing about whether your tests are asking the right questions.&lt;/p&gt;

&lt;p&gt;For v3.1.0 we added a &lt;code&gt;spar&lt;/code&gt; subcommand — an adversarial regression loop that interrogates the scorer from the outside. Running SPAR against the v3.0.x scorer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SPAR score: 55 / 100  [FAIL]

Layer A anomalies:
  A3 stub_class_8_methods     expected &amp;gt;= 30  got 20.0  [ANOMALY]
  A4 fragmented_god_function  expected &amp;gt;= 10  got  0.0  [ANOMALY]
  A5 vocab_clean_meaningless  expected &amp;gt;=  8  got  0.0  [ANOMALY]

Layer C blind spots:
  C2 inflation_blindspot      [BLIND_SPOT]
  C3 ddc_annotation_gap       [BLIND_SPOT]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three gaps. Two documented scope limits. Score: 55 FAIL.&lt;/p&gt;

&lt;p&gt;Each gap pointed at a specific detection weakness. The SPAR methodology itself — how Layer A/B/C work, why adversarial ground truth is hard to author from inside the codebase — is a separate topic covered in tomorrow's post. Here we focus on what the gaps told us and what we changed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 1: The calibrator and scorer were using different formulas
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ye6lvz5fzouhj62ksxv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ye6lvz5fzouhj62ksxv.png" alt="the geometric mean" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The scorer computes a &lt;strong&gt;weighted geometric mean&lt;/strong&gt;. The calibrator — which finds optimal weights — was computing a &lt;strong&gt;weighted arithmetic mean&lt;/strong&gt; as its optimization target.&lt;/p&gt;

&lt;p&gt;Those are not the same thing, and for a quality gate, the difference is structural.&lt;/p&gt;

&lt;p&gt;Consider a file with three dimension scores: LDR=0.9 (good), inflation_quality=0.1 (very bad), DDC=0.8 (good).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Formula&lt;/th&gt;
&lt;th&gt;Calculation&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Deficit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Arithmetic mean&lt;/td&gt;
&lt;td&gt;(0.9 + 0.1 + 0.8) / 3&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Geometric mean&lt;/td&gt;
&lt;td&gt;(0.9 × 0.1 × 0.8) ^ (1/3)&lt;/td&gt;
&lt;td&gt;0.42&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The arithmetic mean gives deficit=40. The geometric mean gives deficit=58. The gap is 18 points — not rounding, but structural. The geometric mean &lt;strong&gt;amplifies weak dimensions&lt;/strong&gt; because one bad score pulls the entire product down. The arithmetic mean averages over them.&lt;/p&gt;

&lt;p&gt;The scorer uses the geometric mean for good reason: a file with excellent LDR but zero actual logic (all docstrings) should not score deficit=30. It should score much higher. The formula enforces that.&lt;/p&gt;

&lt;p&gt;The first-generation calibrator used an arithmetic mean as a simpler starting approximation. So it was finding weights that minimize error against a different objective than the scorer actually computes. The result: roughly 5–7 point underestimation on files with uneven dimension profiles — which are precisely the target of this tool.&lt;/p&gt;

&lt;p&gt;The AM ≥ GM inequality means the calibrator's scores were always optimistic. For balanced files (all dimensions similar) the gap is small and harmless. For uneven files, it was systematic — and those are the cases that matter most.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refinement:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before (calibrator _recompute_deficit)
&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w_ldr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ldr&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_inflation&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;inflation_n&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_ddc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ddc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_w&lt;/span&gt;

&lt;span class="c1"&gt;# After — mirrors the scorer's GQG formula exactly
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;log&lt;/span&gt;
&lt;span class="n"&gt;gqg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;w_ldr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ldr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_inflation&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;inflation_n&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;w_ddc&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1e-4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ddc&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_w&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;deficit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;100.0&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;gqg&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is why SPAR anomaly A3 (&lt;code&gt;stub_class_8_methods&lt;/code&gt;) jumped from deficit 20.0 to 40.0: the stub class had heavily uneven dimensions, and the geometric mean scored it correctly once the calibrator was trained against the right target.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 2: The complexity modifier had a dead zone at the common end
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyfwsfopk911dg0uec96.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmyfwsfopk911dg0uec96.png" alt="docstring bloat" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The inflation metric applies a complexity modifier to penalize functions that are simultaneously simple and jargon-heavy — a common pattern in AI-generated code: a two-line function surrounded by an elaborate docstring.&lt;/p&gt;

&lt;p&gt;The first-generation modifier formula:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;complexity_modifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_complexity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For CC=1: &lt;code&gt;1.0 + (1-3)/10 = 0.8&lt;/code&gt; → &lt;code&gt;max(1.0, 0.8)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;br&gt;
For CC=2: &lt;code&gt;1.0 + (2-3)/10 = 0.9&lt;/code&gt; → &lt;code&gt;max(1.0, 0.9)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;br&gt;
For CC=3: &lt;code&gt;1.0 + (3-3)/10 = 1.0&lt;/code&gt; → &lt;code&gt;max(1.0, 1.0)&lt;/code&gt; = &lt;strong&gt;1.0&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;CC=1, 2, and 3 all received the same modifier: 1.0. This meant simple functions — the three most common complexity levels — paid no complexity premium on inflation, regardless of how jargon-heavy they were. The modifier only activated from CC=4 upward.&lt;/p&gt;

&lt;p&gt;Simple jargon-heavy functions are the most common AI code signature. The formula was least sensitive precisely where it needed to be most sensitive.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After — CC=1 is the baseline, not CC=3
&lt;/span&gt;&lt;span class="n"&gt;complexity_modifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_complexity&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="mf"&gt;10.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now CC=2 gets a 1.10× modifier, CC=3 gets 1.20×. The penalty scales from the simplest meaningful function upward.&lt;/p&gt;




&lt;h2&gt;
  
  
  Refinement 3: Purity weight was documented but not connected
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1d144fslfbtceumfn9o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc1d144fslfbtceumfn9o.png" alt="catching stub piplelines and placeholder variable floods" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The GQG formula includes a purity dimension:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Before
&lt;/span&gt;&lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;  &lt;span class="c1"&gt;# hardcoded constant
&lt;/span&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;gqg_score&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;purity_penalty&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;.slopconfig.yaml&lt;/code&gt; had a &lt;code&gt;weights.purity&lt;/code&gt; field. The calibrator's weight search had a purity parameter. Neither was connected to this constant — users could configure &lt;code&gt;weights.purity: 0.20&lt;/code&gt; and nothing would change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# After
&lt;/span&gt;&lt;span class="n"&gt;w_pur&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;purity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# default unchanged; now configurable
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line. The config surface now matches the implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two new detection patterns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Stub evasion: empty container returns
&lt;/h3&gt;

&lt;p&gt;The existing &lt;code&gt;return_constant_stub&lt;/code&gt; pattern caught &lt;code&gt;return True&lt;/code&gt;, &lt;code&gt;return 0&lt;/code&gt;, &lt;code&gt;return "string"&lt;/code&gt; — but not &lt;code&gt;return {}&lt;/code&gt;, &lt;code&gt;return []&lt;/code&gt;, &lt;code&gt;return ()&lt;/code&gt;, &lt;code&gt;return set()&lt;/code&gt;. These are equally common stub patterns in class skeletons:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;DataProcessor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_results&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;  &lt;span class="c1"&gt;# was not flagged before
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;list_items&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;  &lt;span class="c1"&gt;# was not flagged before
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both are now caught by &lt;code&gt;return_constant_stub&lt;/code&gt; and &lt;code&gt;interface_only_class&lt;/code&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  Fragmented god function: AST clone detection
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimenh3uabx0lodo1yqs0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fimenh3uabx0lodo1yqs0.png" alt="AST Jensen-shannon divergence" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;SPAR anomaly A4 was a file with 12 one-liner helper functions:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.2&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_compute_r3&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;1.3&lt;/span&gt;
&lt;span class="c1"&gt;# ... through r12
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function individually looks clean: low complexity, no nesting, short. No single function exceeds any per-function threshold. But collectively, this is a decomposed god function — a large computation split into structurally identical fragments that evade per-function gates.&lt;/p&gt;

&lt;p&gt;The new pattern: &lt;code&gt;function_clone_cluster&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works.&lt;/strong&gt; For each file, build a 30-dimensional histogram of AST node types for every function: how many &lt;code&gt;If&lt;/code&gt; nodes, &lt;code&gt;Return&lt;/code&gt; nodes, &lt;code&gt;Call&lt;/code&gt; nodes, &lt;code&gt;BinOp&lt;/code&gt; nodes, and so on. The histogram is normalized to a probability distribution. Then compute pairwise Jensen-Shannon Divergence between all function pairs. JSD is bounded between 0 and 1. Two functions with near-identical AST structure produce JSD close to 0.&lt;/p&gt;

&lt;p&gt;Functions with JSD &amp;lt; 0.05 get an edge in a graph. BFS finds connected components. The largest component is the clone cluster.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Thresholds:
  &amp;gt;= 6 functions in cluster: CRITICAL
  &amp;gt;= 4 functions in cluster: HIGH
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why JSD and not simpler metrics.&lt;/strong&gt; Cosine similarity or Euclidean distance on raw histograms don't handle sparse distributions well — short functions have mostly empty histograms, and small absolute differences dominate. JSD compares distributions rather than raw vectors, stable when most histogram dimensions are near zero. It also has an upper bound of 1, which makes the 0.05 threshold interpretable rather than dataset-dependent.&lt;/p&gt;

&lt;p&gt;The JSD threshold (0.05) was calibrated against the internal test corpus. It will produce false positives on files with many similar utility functions — for example, a large set of &lt;code&gt;_validate_field_X()&lt;/code&gt; validators that are structurally identical by design. Adjust via &lt;code&gt;--config&lt;/code&gt; if needed.&lt;/p&gt;




&lt;h3&gt;
  
  
  Placeholder variable naming (v1.0)
&lt;/h3&gt;

&lt;p&gt;SPAR anomaly A5 was vocabulary-clean code with zero semantic content:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;aggregate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;g&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;b&lt;/span&gt;
    &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;
    &lt;span class="n"&gt;r3&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r2&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;
    &lt;span class="c1"&gt;# ... through r12
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No buzzwords. No docstring bloat. Every traditional linter passes this. The new &lt;code&gt;placeholder_variable_naming&lt;/code&gt; pattern applies two checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Single-letter parameter density&lt;/strong&gt;: 5 or more single-letter parameters (excluding &lt;code&gt;self&lt;/code&gt;, &lt;code&gt;cls&lt;/code&gt;, &lt;code&gt;_&lt;/code&gt;) → HIGH.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sequential numbered variables&lt;/strong&gt;: a run of 8 or more → HIGH; 4 or more → MEDIUM.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is v1.0: it detects naming &lt;em&gt;style&lt;/em&gt;, not semantic quality. Known false positive zone: scientific and math libraries legitimately use single-letter conventions (&lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;, &lt;code&gt;z&lt;/code&gt;, &lt;code&gt;mu&lt;/code&gt;, &lt;code&gt;sigma&lt;/code&gt;). Suppress with &lt;code&gt;domain_overrides&lt;/code&gt; in &lt;code&gt;.slopconfig.yaml&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  SPAR result after v3.1.0
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;SPAR score: 85 / 100  [PASS]

Layer A: 5/5 anchors consistent
Layer B: 4 documented limitations (no regressions)
Layer C: C2 inflation_blindspot [BLIND_SPOT — known scope limit]
         C3 ddc_annotation_gap  [BLIND_SPOT — known scope limit]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;55 → 85 PASS.&lt;/p&gt;

&lt;p&gt;The two remaining blind spots are not gaps to close — they're the documented scope limits of static analysis: a tool that reads AST cannot determine whether arithmetic is semantically meaningful, or whether annotation-heavy imports serve a real runtime purpose. Those require a different class of model. Documenting the ceiling is part of the job.&lt;/p&gt;

&lt;p&gt;The full SPAR methodology — how Layer A/B/C work, why Layer A ground truth is hard to author from inside the codebase, and what "validating the validator" means in practice — is covered in tomorrow's post.&lt;/p&gt;




&lt;h2&gt;
  
  
  v3.1.1: the self-inspection patch
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5ynlx4hh85jso5rqdq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9r5ynlx4hh85jso5rqdq.png" alt="dog food" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v3.1.0 and v3.1.1 shipped on the same day. The clone detection pattern introduced in v3.1.0 had a visibility gap: &lt;code&gt;function_clone_cluster&lt;/code&gt; fired in the Issues section but produced no signal in the Core Metrics table. A community issue caught it within hours.&lt;/p&gt;

&lt;p&gt;But before cutting v3.1.1, we ran the tool against itself — and the new patterns found something:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;placeholder.py    deficit: 70.3  [CRITICAL_DEFICIT]
python_advanced.py  deficit: 74.0  [CRITICAL_DEFICIT]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both files are part of the detection engine itself. Root cause: &lt;code&gt;check_node&lt;/code&gt; methods with cyclomatic complexity 20–31, caused by compound boolean logic that had accumulated across releases. The tool was flagging its own pattern implementations as having the exact complexity problems it was designed to detect.&lt;/p&gt;

&lt;p&gt;We extracted four module-level helpers in &lt;code&gt;placeholder.py&lt;/code&gt; (&lt;code&gt;_strip_docstring&lt;/code&gt;, &lt;code&gt;_has_abstractmethod&lt;/code&gt;, &lt;code&gt;_empty_container_repr&lt;/code&gt;, &lt;code&gt;_is_placeholder_stmt&lt;/code&gt;) and added &lt;code&gt;_make_god_issue()&lt;/code&gt; and &lt;code&gt;_collect_numbered_vars()&lt;/code&gt; to &lt;code&gt;python_advanced.py&lt;/code&gt;. Each &lt;code&gt;check_node&lt;/code&gt; method went from 20–70 lines to 8–15. The detector earned its own PASS before shipping the patch.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;placeholder.py      70.3 → 43.7  [CRITICAL → SUSPICIOUS]
python_advanced.py  74.0 → 66.7  [CRITICAL → INFLATED_SIGNAL]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional v3.1.1 refinements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clone Detection row&lt;/strong&gt; added to Core Metrics table (CRITICAL/PASS at a glance).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Table style unified&lt;/strong&gt; to &lt;code&gt;box.ROUNDED&lt;/code&gt; across all project output (was mixing three styles).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VS Code extension&lt;/strong&gt;: &lt;code&gt;extractJson()&lt;/code&gt; strips &lt;code&gt;[INFO]&lt;/code&gt; log lines before &lt;code&gt;JSON.parse&lt;/code&gt; — previously caused silent parse failures when CLI log output appeared alongside JSON. Workspace analysis replaced with a QuickPick list of deficit files sorted by score; clicking opens the file in the editor.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you installed 3.1.0, upgrade to 3.1.1 before using clone detection in CI.&lt;/p&gt;




&lt;h2&gt;
  
  
  How this fits alongside existing tools
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbl4ag50tf34mrhioj0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffbl4ag50tf34mrhioj0q.png" alt="compare" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;What it sees that others don't&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semgrep&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Pattern-matching on AST&lt;/td&gt;
&lt;td&gt;Rule violations you've pre-authored&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SonarQube&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cognitive complexity, duplication, coverage&lt;/td&gt;
&lt;td&gt;Complexity, coverage gaps — not structural emptiness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Radon&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Cyclomatic complexity&lt;/td&gt;
&lt;td&gt;Raw CC values; used internally by AI SLOP Detector&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bandit&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Security rules&lt;/td&gt;
&lt;td&gt;Security vulnerabilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;mutmut / cosmic-ray&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Mutation testing&lt;/td&gt;
&lt;td&gt;Whether your test suite catches real bugs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI SLOP Detector&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Metric-based structural analysis&lt;/td&gt;
&lt;td&gt;Docstring theater, stub pipelines, fragmented logic, phantom imports&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key gap: a file can be fully SonarQube-clean while containing zero actual logic — all stubs, all docstrings, all type annotations. Cognitive complexity doesn't measure whether the complexity is real. LDR does. Inflation does.&lt;/p&gt;

&lt;p&gt;The complementary tool here is mutation testing. SPAR tests whether the scorer measures what it claims. Mutation testing tests whether your tests catch what they claim to catch. Both are adversarial approaches to the meta-problem: how do you validate the validator?&lt;/p&gt;




&lt;h2&gt;
  
  
  Score evolution
&lt;/h2&gt;

&lt;p&gt;If you're running AI SLOP Detector on an existing project, upgrading to 3.1.x will change your scores. The formula alignment in Refinement 1 increases deficit on files with uneven dimension profiles, typically by 3–8 points. This is not drift — it's the scorer becoming more precise in the region where it matters most. Files that were borderline &lt;code&gt;suspicious&lt;/code&gt; may move into &lt;code&gt;inflated_signal&lt;/code&gt;. Check your CI threshold after upgrading.&lt;/p&gt;

&lt;p&gt;Previous scores were valid estimates produced by the first-generation model. v3.1.x scores are tighter estimates with better sensitivity where dimensions are uneven — which is precisely the profile of AI-generated code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Honest limitations
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;function_clone_cluster&lt;/code&gt; threshold (JSD &amp;lt; 0.05) was calibrated against the internal test corpus. It will fire false positives on legitimate utility function clusters. Adjust via &lt;code&gt;--config&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;placeholder_variable_naming&lt;/code&gt; v1.0 has no semantic context. &lt;code&gt;def distance(x, y, z)&lt;/code&gt; is legitimate; the pattern doesn't know that.&lt;/li&gt;
&lt;li&gt;SPAR score 85 means five ground truth anchors pass and eight of ten Layer C probes hold. The space of evasion patterns is open-ended. More in tomorrow's SPAR post.&lt;/li&gt;
&lt;li&gt;The Layer A corpus is internally authored. External adversarial contributions would make it stronger.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Install / upgrade
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;ai-slop-detector&lt;span class="o"&gt;==&lt;/span&gt;3.1.1
&lt;span class="c"&gt;# or&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--upgrade&lt;/span&gt; ai-slop-detector
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;VS Code extension: search &lt;strong&gt;"AI SLOP Detector"&lt;/strong&gt; in Extensions, or install from VSIX:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;code &lt;span class="nt"&gt;--install-extension&lt;/span&gt; vscode-slop-detector-3.1.1.vsix
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Scan a project&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; ./your-project

&lt;span class="c"&gt;# Machine-readable output&lt;/span&gt;
slop-detector &lt;span class="nt"&gt;--project&lt;/span&gt; ./your-project &lt;span class="nt"&gt;--json&lt;/span&gt; | jq &lt;span class="s1"&gt;'.file_results[] | {file: .file_path, deficit: .deficit_score}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;GitHub: &lt;a href="https://github.com/flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;flamehaven01/AI-SLOP-Detector&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Previous posts in this series:&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/the-tool-that-turned-on-itself-ai-slop-detector-v290-v291-3oc4"&gt;v2.9.0/v2.9.1: The Tool That Turned On Itself&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v270-why-we-built-a-linter-we-actually-use-2nb6"&gt;v2.7.0: Why We Built a Linter We Actually Use&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/ai-slop-detector-v263-is-live-on-vs-code-3oj4"&gt;v2.6.3: Now on VS Code&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/flamehaven01/i-built-an-ecosystem-of-46-ai-assisted-repos-then-i-realized-it-might-be-eating-itself-46ni"&gt;fhval: I Built an Ecosystem of 46 AI-Assisted Repos. Then I Realized It Might Be Eating Itself.&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>opensource</category>
      <category>python</category>
      <category>architecture</category>
      <category>devtool</category>
    </item>
    <item>
      <title>My AI Maintainer Kept Making Wrong Calls. So I Made It Report Its State Before Touching Anything.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Tue, 07 Apr 2026 17:24:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7</link>
      <guid>https://dev.to/flamehaven01/my-ai-maintainer-kept-making-wrong-calls-so-i-made-it-report-its-state-before-touching-anything-2df7</guid>
      <description>&lt;h2&gt;
  
  
  🔎 Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;memory_injection&lt;/strong&gt;: A MICA operational mode. The archive is updated after each maintenance session and read by the next AI session to compensate for session amnesia.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report Format&lt;/strong&gt;: The structured opening output the model must produce at session start — declaring archive version, self-test result, drift status, and active invariants — before any repository-level work begins.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Self-Test Policy&lt;/strong&gt;: Machine-evaluable checks that validate the archive against the real project state: file existence, hash integrity, and invocation protocol presence.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Drift Response Policy&lt;/strong&gt;: The schema-level declaration of how hash mismatches and missing files are handled. Different failure classes carry different response actions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Design Invariant&lt;/strong&gt;: A structured governance rule with identity, severity, and statement. Not a style guideline. A constraint the model cannot rationalize past.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Provenance Registry&lt;/strong&gt;: The record of tracked files with SHA256 hashes. The basis for drift detection.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Deviation Log&lt;/strong&gt;: The audit trail of formal exceptions to design invariants. Empty means no exceptions have been logged — not that no judgment calls were made.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6d3fkvj8ciwgp3p6u1eu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6d3fkvj8ciwgp3p6u1eu.png" alt="coverimage" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What Part 5 Left Open
&lt;/h2&gt;

&lt;p&gt;Part 5 placed MICA inside the context engineering landscape and drew one boundary: MICA is not a collection system. It begins after collection ends. Its job is to govern what enters the session, what remains authoritative, and how the model proves it actually loaded the governed archive at all.&lt;/p&gt;

&lt;p&gt;That answer was correct. It was also still abstract.&lt;/p&gt;

&lt;p&gt;This post comes down from that framing. It shows what MICA looks like when it is actually running — not as a concept, but as a protocol inside a real project.&lt;/p&gt;

&lt;p&gt;The project is &lt;code&gt;flamehaven.space&lt;/code&gt;, a Next.js B2B site maintained by a solo operator using a MICA package in &lt;code&gt;memory_injection&lt;/code&gt; mode. Everything shown here is drawn from the live archive. Values that would expose internal configuration are anonymized; structure and behavior are real.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Session Opening Report
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekhdvby8iwqn2iby9j2u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fekhdvby8iwqn2iby9j2u.png" alt="The paradigm shift" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every MICA session in &lt;code&gt;memory_injection&lt;/code&gt; mode begins with a declared output before any work starts. The archive specifies the required format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: flamehaven-space-maintainer v0.2.0
Self-test: PASS (3 checks -- ST-001, ST-002, ST-003)
Drift: no hash mismatch detected
Active invariants: DI-001 (critical), DI-002 (critical) + 24 others loaded
Gate: PASS -- proceeding with maintenance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not a courtesy summary. It is a gate. The archive field &lt;code&gt;session_report_format.gate_block_on&lt;/code&gt; is set to &lt;code&gt;critical_self_test_failure&lt;/code&gt; — meaning the model must declare its load state before it is permitted to make any repository-level changes.&lt;/p&gt;

&lt;p&gt;The format is specified in the archive itself:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;json
"session_report_format": {
  "trigger": "session_start",
  "required_fields": ["archive_version", "self_test", "drift_status", "active_invariants", "gate"],
  "format_template": "[SESSION READY]\nArchive: {archive_version}\nSelf-test: {self_test}\nDrift: {drift_status}\nActive invariants: {active_invariants}\nGate: {gate}"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model does not decide what to declare. The archive tells it what a valid session opening looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. What the Self-Test Actually Checks
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4lxbrkjs3mhf8o3vbrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu4lxbrkjs3mhf8o3vbrh.png" alt="self-test mechanics" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;self_test_policy&lt;/code&gt; runs on &lt;code&gt;session_start&lt;/code&gt; and &lt;code&gt;pre_handoff&lt;/code&gt;. Three checks matter here:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ST-001&lt;/strong&gt; (&lt;code&gt;provenance_sha256_format&lt;/code&gt;, severity: &lt;code&gt;error&lt;/code&gt;) — verifies that provenance hashes in the registry match the expected format. A malformed hash means the file fingerprint is untrustworthy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ST-002&lt;/strong&gt; (&lt;code&gt;provenance_file_exists&lt;/code&gt;, severity: &lt;code&gt;warning&lt;/code&gt;) — verifies that files listed in the provenance registry actually exist on disk. A missing file is not a formatting error; it is a ghost reference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ST-003&lt;/strong&gt; (&lt;code&gt;invocation_pattern_present&lt;/code&gt;, severity: &lt;code&gt;error&lt;/code&gt;) — verifies that the invocation protocol is declared and readable. If the model cannot confirm how it was loaded, it cannot confirm the session is in a governed state.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Failure behavior is set per-check. The overall &lt;code&gt;on_failure&lt;/code&gt; policy for this archive is &lt;code&gt;warn_continue&lt;/code&gt; — the session proceeds, but the failure is surfaced explicitly in the opening report.&lt;/p&gt;

&lt;p&gt;This is a deliberate calibration. A site maintenance session that blocks hard on every provenance warning would be too brittle for solo operation. The severity ladder reflects the actual cost of each failure mode, not a theoretical maximum.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. What Drift Detection Looks Like in Practice
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye82pmwyaq0ek7d7qp2k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye82pmwyaq0ek7d7qp2k.png" alt="drift response policy" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The archive's &lt;code&gt;drift_response_policy&lt;/code&gt; is minimal by design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"drift_response_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_hash_mismatch"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warn_continue"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"on_file_missing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"warn_block"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reminder_after_change"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"inline_sync_required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The distinction between &lt;code&gt;warn_continue&lt;/code&gt; and &lt;code&gt;warn_block&lt;/code&gt; is operationally significant.&lt;/p&gt;

&lt;p&gt;A hash mismatch means a tracked file changed — which happens legitimately during ordinary maintenance. The model surfaces the mismatch and continues.&lt;/p&gt;

&lt;p&gt;A file that has gone missing entirely is a different failure class. The model blocks and requires operator acknowledgment before proceeding.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;reminder_after_change: true&lt;/code&gt; means the archive instructs the model to remind the operator to refresh the provenance registry and artifact manifest before minting the next archive version. This is not automated enforcement. It is memory injection: the archive tells the next session what the previous session should have reminded the operator to do.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;deviation_log&lt;/code&gt; in v0.2.0 is empty. That is not a sign the system has never been used. It means no deviations have been formally logged yet — which is itself a state the archive captures.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. What Happens When a Deployment Changes Something
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focdk66if1kifuqc9znha.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Focdk66if1kifuqc9znha.png" alt="deployment evolution" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A concrete scenario: the operator ships a writing refresh mode change — switching from automatic ISR to manual operator-triggered revalidation. Three files change: &lt;code&gt;next.config.ts&lt;/code&gt;, a helper &lt;code&gt;.bat&lt;/code&gt; script, and the playbook.&lt;/p&gt;

&lt;p&gt;On the next session open:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Self-test runs. ST-002 may flag if the helper script path is not in the provenance registry yet.&lt;/li&gt;
&lt;li&gt;Drift check runs. Hash mismatches fire for the changed files. &lt;code&gt;on_hash_mismatch: warn_continue&lt;/code&gt; — the session proceeds.&lt;/li&gt;
&lt;li&gt;The opening report surfaces the mismatch:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Archive: flamehaven-space-maintainer v0.2.0
Self-test: PASS with warnings (ST-002: update-writing-now.bat not in provenance registry)
Drift: hash mismatch on next.config.ts, flamehaven-space-maintainer-playbook.v0.2.0.md
Active invariants: DI-001 (critical), DI-002 (critical) + 24 others
Gate: PASS WITH WARNINGS -- operator review required
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The operator now has a concrete decision surface before touching anything: what changed, what the model knows about, and what it does not.&lt;/p&gt;

&lt;p&gt;The model then follows the change process defined in the playbook: identify the canonical subsystem touched, patch the smallest coherent surface, run build and audit, verify route-level behavior, then update README, MICA, or playbook if the change alters maintainer truth.&lt;/p&gt;

&lt;p&gt;At the end of the session, if a new archive version is minted, the synchronization rule is explicit: file name, &lt;code&gt;project.version&lt;/code&gt;, and the archive handoff marker must be updated in the same change. Not sequentially. The same change.&lt;/p&gt;

&lt;p&gt;This is the operational point of drift reporting: not merely to announce that something changed, but to force the model and the operator to see the same changed surface before any new work proceeds.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. What the Design Invariants Actually Govern
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfbq1ujtnozummxbd8pl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfbq1ujtnozummxbd8pl.png" alt="design invariants" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The archive carries 26 design invariants. The first six establish the perimeter everything else operates within:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DI-001&lt;/strong&gt; (critical): Flamehaven is positioned as a governance-first, founder-led B2B AI systems practice, not a generic agency or AI wrapper shop.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-002&lt;/strong&gt; (critical): Primary conversion surface is the main domain &lt;code&gt;flamehaven.space&lt;/code&gt;, not Medium, Substack, DEV.to, or LinkedIn.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-003&lt;/strong&gt; (high): Writing detail pages are authoritative artifacts linked to projects, selected work, contact, and SEO canonical ownership.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-004&lt;/strong&gt; (high): Selected Work must distinguish public, private, and internal systems without broken public links or placeholder copy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-005&lt;/strong&gt; (high): Legacy WordPress-era routes must redirect away from obsolete agency messaging.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DI-006&lt;/strong&gt; (high): Operational choices favor deterministic behavior, inspectability, and maintenance continuity over decorative complexity.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not style guidelines. They are session-blocking constraints. An AI that proposes converting Selected Work to a live-fetch real-time surface is violating DI-006. An AI that treats cross-posting as the canonical publishing path is violating DI-002. An AI that leaves a legacy route live because a redirect “seems unnecessary” is violating DI-005.&lt;/p&gt;

&lt;p&gt;The invariants exist so the model cannot rationalize its way past the operator's architectural intent — even across sessions, even with a new model instance that has never seen the project before.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. What MICA Cannot Do Here
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7tpmb7ohzye0el0dj8i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo7tpmb7ohzye0el0dj8i.png" alt="system limits and human authority" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;deviation_log&lt;/code&gt; is empty because no formal deviation has been logged. But there have been judgment calls.&lt;/p&gt;

&lt;p&gt;One example is the writing hero image fallback logic. The site had to support both modern Notion block structures and legacy imported posts with a different nesting format. That decision did not begin as an invariant. It began as a session-level judgment call, became a playbook rule, and only then became stable maintainer truth.&lt;/p&gt;

&lt;p&gt;That path matters.&lt;/p&gt;

&lt;p&gt;MICA does not automate the step from “we discussed this and made a call” to “this is now a governed constraint.” It provides the place to put the result. The operator decides what rises to the level of an invariant, what remains a lesson in the playbook, and what disappears when the session ends.&lt;/p&gt;

&lt;p&gt;That boundary — what gets governed, what gets remembered, what gets lost — is not a gap in MICA. It is a design decision made with every archive update.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What Part 7 Will Address
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjhw3bmshi00jb3efpmd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmjhw3bmshi00jb3efpmd.png" alt="preview part7" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 6 showed what MICA looks like in operation inside a single maintenance agent. The structure holds. The protocol runs. The session report is predictable.&lt;/p&gt;

&lt;p&gt;But that is still the easier case.&lt;/p&gt;

&lt;p&gt;The project being governed was a site: a relatively stable artifact, maintained by one operator, where the main problem was making sure the model did not forget what already mattered.&lt;/p&gt;

&lt;p&gt;Part 7 moves into a harder setting.&lt;/p&gt;

&lt;p&gt;The governed project is now a tool that runs inside AI workflows itself. That changes the governance problem. The issue is no longer only session amnesia. It is iterative accumulation: what the system learns across cycles, what becomes authoritative, what remains provisional, and what must be carried forward without allowing drift to harden into false memory.&lt;/p&gt;

&lt;p&gt;Part 7 is that case.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; A session report is not a polite summary. It is a hard gate. The model must declare — in the exact format dictated by the archive — what it loaded, what tests passed, and what drift it detected, before it is allowed to touch the repository.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.2.0 standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>contextengineering</category>
      <category>governance</category>
    </item>
    <item>
      <title>How Auditing 10 Bio-AI Repositories Shaped STEM-AI</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 30 Mar 2026 12:07:20 +0000</pubDate>
      <link>https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5</link>
      <guid>https://dev.to/flamehaven01/how-auditing-10-bio-ai-repositories-shaped-stem-ai-41b5</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktbn484a53idqb1b8rm8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fktbn484a53idqb1b8rm8.png" alt="Cover image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Reading path:&lt;/strong&gt;&lt;br&gt;
This post is part of a series.&lt;br&gt;
(1) &lt;a href="https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f"&gt;STEM-AI introduction&lt;/a&gt; — what the framework is and why we built it&lt;br&gt;
(2) &lt;a href="https://flamehaven.space/writing/bio-ai-repository-audit-2026-a-technical-report-on-10-open-source-systems/" rel="noopener noreferrer"&gt;Technical audit report&lt;/a&gt; — full findings across 10 repositories&lt;br&gt;
(3) &lt;a href="https://flamehaven.space/writing/i-audited-10-open-source-bio-ai-repos-most-could-produce-outputs-few-could-establish-trust/" rel="noopener noreferrer"&gt;Narrative summary&lt;/a&gt; — what those findings actually mean&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What Text Could See — and What Code Revealed&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favblcoe473uege82hbwl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Favblcoe473uege82hbwl.png" alt="what text could see" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In March 2026, we ran STEM-AI against 10 high-visibility open-source bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;The framework did what it was designed to do. It surfaced missing disclaimers, absent CI, weak reproducibility signals, and public-facing governance gaps. Those findings mattered, and the scores were directionally right.&lt;/p&gt;

&lt;p&gt;But when we reviewed the audits more carefully, one pattern kept appearing: some of the most consequential failures were not visible in the artifact surface at all. They only became obvious when we looked directly at the code.&lt;/p&gt;

&lt;p&gt;This function lives inside a repository presented as an AI-driven drug discovery workflow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def generate_analogues(self, seed_smiles: str, count: int = 3):
    """
    Mocks a generative model (like REINVENT).
    In a real app, this would call a PyTorch model.
    """
    # Simple string manipulation for demo purposes
    analogues = []
    for i in range(count):
        if "C" in seed_smiles:
            new_smi = seed_smiles.replace("C", "C(C)", 1) if i == 0 else seed_smiles + "F"
            analogues.append(new_smi)
        else:
            analogues.append(seed_smiles + "C")
    return analogues
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It does not generate molecules. It appends characters.&lt;/p&gt;

&lt;p&gt;SMILES (Simplified Molecular Input Line Entry System) is a strict notation for molecular structure. A valid SMILES string encodes real geometry and bonding. Appending C produces a syntactically valid string that represents no real compound. The function runs without error, returns a list, and the pipeline continues downstream.&lt;/p&gt;

&lt;p&gt;Our framework scored this repository T0. Correctly. But not because it saw this function.&lt;/p&gt;

&lt;p&gt;It scored T0 because the README was missing disclaimers. The CI was absent. Reproducibility was undocumented. Text-path evaluation is designed to measure exactly that. It did.&lt;/p&gt;

&lt;p&gt;The audit result was correct. The evidence surface had room to go deeper.&lt;/p&gt;

&lt;p&gt;Running the audits showed us what code-path evaluation could add on top.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What Code Access Makes Visible&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhefzdhoiz8nakmrf3959.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhefzdhoiz8nakmrf3959.png" alt="What Code Access Makes Visible" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The drug discovery example was not unusual.&lt;/p&gt;

&lt;p&gt;CellAgent's pipeline ends with this call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# main.py, line 153
&lt;/span&gt;&lt;span class="n"&gt;final_result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;planner&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_final_result&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The method exists. Its body is &lt;code&gt;pass&lt;/code&gt;. The pipeline completes without error and produces nothing. A text audit reading the README would have no way to know this.&lt;/p&gt;

&lt;p&gt;BioAgents includes a rate limiter for external API calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// rateLimiter.ts, lines 62-68&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;USE_JOB_QUEUE&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rateLimiter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;consume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Job queue disabled - skip rate limiting for direct calls&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;USE_JOB_QUEUE&lt;/code&gt; defaults to &lt;code&gt;false&lt;/code&gt; in &lt;code&gt;.env.example&lt;/code&gt;. Every default deployment has rate limiting disabled. The function name implies protection. In default operation, there is none.&lt;/p&gt;

&lt;p&gt;The pattern across all three: &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the code looks governed. &lt;/li&gt;
&lt;li&gt;the behavior tells a different story. &lt;/li&gt;
&lt;li&gt;That story is only visible when you read the code.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Text scores and code behavior can diverge. Knowing where and how they diverge is the next layer of evidence worth capturing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Four Directions the Audits Opened
&lt;/h2&gt;

&lt;p&gt;Reviewing all ten audits, we identified four areas where code-path evaluation could extend what text auditing already does well.&lt;/p&gt;

&lt;h3&gt;
  
  
  Direction 1: Clinical exposure is visible in imports, not just in README text.
&lt;/h3&gt;

&lt;p&gt;A repository importing pharmacogenomics allele tables has clinical exposure regardless of what its README says. Detecting that dependency at the import level — rather than waiting for a disclaimer — lets the framework flag exposure earlier. &lt;/p&gt;

&lt;p&gt;The key distinction is severity: a direct pharmacogenomics import (&lt;code&gt;CYP2D6&lt;/code&gt;, &lt;code&gt;CPIC&lt;/code&gt;) signals live patient-facing risk and is classified CA-DIRECT. &lt;/p&gt;

&lt;p&gt;A general-purpose medical imaging library like &lt;code&gt;pydicom&lt;/code&gt; or MONAI is classified CA-INDIRECT — research-use exposure, not necessarily a live clinical output path. The import alone does not determine clinical risk; the classification tier does.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 2: Not all clinical proximity is the same.
&lt;/h3&gt;

&lt;p&gt;A live pharmacogenomics dosage tool and a README roadmap note about a future ClinVar integration are not equivalent risks. Differentiating them — live output vs. research context vs. planned feature — makes the evaluation more precise and makes the accountability expectations more appropriate.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 3: Scoring stability is worth measuring directly.
&lt;/h3&gt;

&lt;p&gt;We ran Stage 1 on one repository in multiple passes. The results ranged across 28 points on the same input. Overlapping trigger conditions between hype-detection items are one contributing factor. &lt;/p&gt;

&lt;p&gt;LLM runtime stochasticity is another — the exact split between the two is still under measurement. Adding explicit discrimination examples — what exact phrasing triggers each item, what does not — makes the scoring surface cleaner and reduces the most obvious sources of variance.&lt;/p&gt;




&lt;h3&gt;
  
  
  Direction 4: Code-path behavior deserves its own scan layer.
&lt;/h3&gt;

&lt;p&gt;A fail-open pattern is a control path that appears to enforce a constraint but defaults to bypassing it. The BioAgents rate limiter above is the example. In a clinical output path, a silent pass-through is not graceful degradation. It is an untraced result that looks like a real one. Building a dedicated scan for these patterns adds a check that text auditing was never meant to provide.&lt;/p&gt;




&lt;p&gt;These four directions came directly from running the audits. The scores across the 10 repositories remain as published. Code-path evaluation is what the framework can now add on top of them.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What v1.0.6 Added — Carried Forward in v1.1.2&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;These changes were introduced in v1.0.6 and are carried forward in the current internal v1.1.2 package. They extend the framework's evidence surface into code-level behavior. Calibration is ongoing.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Two Evidence Paths, Not One
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkol0yxx1ht5r0svy5yk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkol0yxx1ht5r0svy5yk.png" alt="Two Evidence Paths, Not One" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We narrowed one of the biggest divergence points by splitting evaluation into a text path and a code path.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;text path&lt;/strong&gt; works as before: read the README, CHANGELOG, and public posts, score against the rubric. Always available regardless of access to the repository.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;code path&lt;/strong&gt; activates when the audit has a local clone. It runs through Claude Code, Codex CLI, Gemini CLI, or Copilot CLI. Claims are not interpreted. They are measured. A README that says "IRB-approved data" earns no points for the statement. Points require a provenance artifact in the code.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When code confirms the README, that is a positive signal. When it contradicts it, that contradiction is the finding.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. Clinical Dependency Detection
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbakbfbq74n5vp5d2rig.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqbakbfbq74n5vp5d2rig.png" alt="Clinical Dependency Detection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the start of every local audit, a scan script reads Python imports and README keywords. It classifies the result into one of three severity levels:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# ca_detection_scan.sh -- pharmacogenomics section&lt;/span&gt;

&lt;span class="c"&gt;# CA-DIRECT: live patient-facing output risk&lt;/span&gt;
check_import &lt;span class="s2"&gt;"CPIC|cpic|PharmGx|pharmacogenomic"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Pharmacogenomics (CPIC/PharmGx)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-DIRECT"&lt;/span&gt;

check_import &lt;span class="s2"&gt;"DPYD|CYP2D6|CYP2C19|allele"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"Pharmacogene alleles"&lt;/span&gt; &lt;span class="s2"&gt;"CA-DIRECT"&lt;/span&gt;

&lt;span class="c"&gt;# CA-INDIRECT: research-use clinical exposure&lt;/span&gt;
check_import &lt;span class="s2"&gt;"import pydicom|from pydicom"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"pydicom (DICOM imaging)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;

check_import &lt;span class="s2"&gt;"import monai|from monai"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"MONAI (medical AI)"&lt;/span&gt; &lt;span class="s2"&gt;"CA-INDIRECT"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Accountability requirements follow the actual clinical proximity of the code. Not the aspirational proximity of the roadmap. A roadmap mention without active implementation is treated as CA-PLANNED rather than collapsed into the same bucket as live clinical output. &lt;/p&gt;

&lt;p&gt;The pattern matching is against import statements and function names, not comment text. False positive calibration is still in progress.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Code Integrity Scanning (C1-C4)
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd69g7synh94iizuciqpy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd69g7synh94iizuciqpy.png" alt="Code Integrity Scanning (C1-C4)" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A second scan handles four code-level checks: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hardcoded credentials (C1)&lt;/li&gt;
&lt;li&gt;unpinned dependencies (C2)&lt;/li&gt;
&lt;li&gt;clinical-path stubs (C3)&lt;/li&gt;
&lt;li&gt;fail-open exception handlers (C4).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The C4 check targets the BioAgents-style pattern. Searching for clinical keywords on the &lt;code&gt;except:&lt;/code&gt; line misses most real cases. Clinical context lives in function names and surrounding code. The scan uses a two-pass approach: first identify files with clinical-domain context, then find silent exception handlers within those files.&lt;/p&gt;

&lt;p&gt;A silent &lt;code&gt;except: pass&lt;/code&gt; in a clinical-context file is a trust-surface failure. The scan makes it visible without requiring a reviewer to read every exception block manually.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Discrimination Examples
&lt;/h3&gt;

&lt;p&gt;To reduce the 28-point variance, we added explicit examples for each hype-detection item: what exact phrasing triggers it, what does not, and what the documented edge cases are.&lt;/p&gt;

&lt;p&gt;The goal is to reduce obvious scoring drift enough that the same repository is no longer interpreted as two different trust surfaces across different auditors or different LLMs. That goal is not yet verified. The discrimination examples are the primary mechanism toward it.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;The Three Questions Now Have a Fourth&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;In the first post, we described three questions the framework was built around:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Did the repository describe its limits honestly?&lt;/li&gt;
&lt;li&gt;Did public communication remain consistent with those limits?&lt;/li&gt;
&lt;li&gt;Did the codebase show evidence of maintenance and biological responsibility?&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Running 10 real audits pointed toward a fourth:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;4. Does the code actually do what the documentation says — and where it diverges, is that divergence visible and traceable?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That fourth question is what the audit outputs kept surfacing. A function name that sounds real. A pipeline that looks complete. An output that is plausible. An implementation that is a stub, or a control path that silently bypasses its own constraint.&lt;/p&gt;

&lt;p&gt;The first three questions can be answered by reading. The fourth requires looking at the code.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What the Framework Added — and What Stays the Same&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first post ended with this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"STEM-AI is meant to support serious review, not replace it."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That has not changed. Every report carries a non-removable disclaimer: LLM-generated audit, not a regulatory determination, not clinical certification. Every report carries an expiry date. The minimum threshold for supervised pilot consideration is still T3. None of the March 2026 repositories reached it.&lt;/p&gt;

&lt;p&gt;What the audits added is narrower — broader evidence coverage on top of an already-working foundation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yoxv47netm2tmf881xg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yoxv47netm2tmf881xg.png" alt="What the Framework Added — and What Stays the Same" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A verifiable artifact shifts the accountability surface — it does not eliminate the possibility of falsification. The framework treats its presence as a necessary condition, not a sufficient one.&lt;/p&gt;

&lt;p&gt;What the framework gained is that the evidence it counts now extends beyond what authors say about their own code.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Three Directions Still Ahead&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvbbl6bgo1z79mrg7wgr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flvbbl6bgo1z79mrg7wgr.png" alt="Three Directions Still Ahead" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automated re-audit on repository changes.&lt;/strong&gt; A score from three months ago may not describe the same repository. The trajectory signal measures issue close rate and release frequency across consecutive 90-day windows. It is a partial answer. A CI-triggered re-audit path is the logical next step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The denominator problem.&lt;/strong&gt; Zero of 10 repositories reached T3. This may accurately describe the ecosystem's current state. It may also reflect calibration issues in the upper tiers. Distinguishing between the two requires before-and-after auditing of repositories that have received systematic governance remediation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Stage 2 redistribution question.&lt;/strong&gt; Most audits have no cross-platform consistency data. When that data is unavailable, the framework redistributes Stage 2's weight equally between documentation quality and engineering accountability. &lt;/p&gt;

&lt;p&gt;For repositories with clinical-direct exposure, a well-written README can then compensate for weak code accountability. A guardrail flags this condition. The current redistribution rule is explicit but not yet final — it remains one of the framework's open calibration questions.&lt;/p&gt;




&lt;blockquote&gt;
&lt;p&gt;If there are open-source bio-AI repositories you think should be &amp;gt;audited next, drop them in the comments. Bonus if they claim clinical &amp;gt;relevance, drug discovery, or medical reasoning.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;STEM-AI v1.1.2 — Trust Evaluation Framework for Medical AI Repositories.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;"Code works. But does the author care about the patient?"&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>bioinformatics</category>
      <category>healthtech</category>
    </item>
    <item>
      <title>Everyone Was Talking About Context Engineering. Nobody Had Solved Governance.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Wed, 25 Mar 2026 09:29:42 +0000</pubDate>
      <link>https://dev.to/flamehaven01/everyone-was-talking-about-context-engineering-nobody-had-solved-governance-424j</link>
      <guid>https://dev.to/flamehaven01/everyone-was-talking-about-context-engineering-nobody-had-solved-governance-424j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Disclosure: This article was written by the author with AI assistance for editing. All technical content, architecture decisions, and design rationale are the author's own. #ABotWroteThis&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Fail-Closed Gate&lt;/strong&gt;: An admission rule that excludes a context item if it fails a required threshold — regardless of its score on other dimensions. No exceptions. Introduced in v0.1.7.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;README-as-Protocol&lt;/strong&gt;: The pattern in which an AI session's natural behavior of reading the README first is formalized as the primary invocation mechanism. No installation required. Introduced in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation Protocol&lt;/strong&gt;: The schema-level declaration of how a MICA archive reaches an AI session — and how the session confirms it was loaded. Formalized as a required field in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report Format&lt;/strong&gt;: The structured opening report the model must produce at session start to confirm the archive was loaded. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Design Invariant Entry&lt;/strong&gt;: A structured governance rule with identity, rule text, and severity. Replaced plain string invariants in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Self-Test Policy&lt;/strong&gt;: Machine-evaluable checks that validate the archive against the real project state — file existence, hash integrity, and README sync. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Playbook&lt;/strong&gt;: The operator-facing discipline layer that sits outside the schema. The schema enforces structure; the Playbook enforces judgment.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Context Engineering&lt;/strong&gt;: The practice of shaping what the model sees, in what order, with what boundaries, and under what assumptions — not just what you ask it, but what it actually knows at runtime.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;CTX&lt;/strong&gt;: A collection-first context packaging approach that gathers relevant workspace material and delivers it to the model. In this article, CTX represents the collection layer of context engineering — answering, “What does the AI see?”&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqx6707g8j9hu59d7wiu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqx6707g8j9hu59d7wiu.png" alt="Trustworthy context engineering" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. What Parts 1 Through 4 Actually Established
&lt;/h2&gt;

&lt;p&gt;The first four parts of this series already narrowed the problem considerably.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/my-llm-kept-forgetting-my-project-so-i-built-a-governance-schema-4bo6"&gt;Part 1&lt;/a&gt; defined the failure mode.&lt;/strong&gt;&lt;br&gt;
The issue was not that long-running AI work needed “more prompt.” The issue was that a model can only act on what it actually knows right now, and most project context systems still treat that as a document problem instead of a governance problem.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626"&gt;Part 2&lt;/a&gt; established the first hard boundary.&lt;/strong&gt;&lt;br&gt;
A schema can exist, and the model can still have no reliable way to know it exists. That is the difference between a document and a constraint.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-stake-was-governance-outside-the-schema-mica-v015-pulled-it-in-46n9"&gt;Part 3&lt;/a&gt; moved governance into the schema.&lt;/strong&gt;&lt;br&gt;
Provenance, deviations, and semantic rules could no longer remain outside the system in READMEs, comments, or team habits. They had to become machine-readable structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9"&gt;Part 4&lt;/a&gt; made that structure operative.&lt;/strong&gt;&lt;br&gt;
The model already treated the README as its natural entry surface. Once that behavior was declared as an invocation protocol, the schema stopped being a passive archive and became a runtime contract.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That progression defines what MICA is actually trying to solve.&lt;/p&gt;

&lt;p&gt;It is not trying to be “better prompting.”&lt;br&gt;&lt;br&gt;
It is not trying to be “more retrieval.”&lt;br&gt;&lt;br&gt;
It is trying to answer a narrower question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How does governed context reach the model, under declared rules, with confirmable load and auditable change?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the bridge into the broader landscape.&lt;/p&gt;


&lt;h2&gt;
  
  
  2. Context Engineering Was Never Just Prompting
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1mx3x0xenx6pc80hd16.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg1mx3x0xenx6pc80hd16.png" alt="The structural gap in context engineering" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One clarification matters before going further.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context engineering&lt;/strong&gt; is not just prompt writing.&lt;/p&gt;

&lt;p&gt;At the broadest level, it is the practice of shaping what the model sees, in what order, with what boundaries, and under what assumptions. Prompts are one part of that. Retrieval is another. File selection, memory handoff, system instructions, and workspace state all belong to the same larger question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the model actually know right now?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;By that definition, MICA is part of context engineering.&lt;/p&gt;

&lt;p&gt;Not because it retrieves context.&lt;br&gt;&lt;br&gt;
Not because it packs more tokens into a window.&lt;br&gt;&lt;br&gt;
But because it governs which context is allowed to shape the session, under what trust conditions, and with what record when those conditions are tested.&lt;/p&gt;

&lt;p&gt;That distinction matters, because most of the field has focused on a different layer.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. The Conversation Was Already Happening
&lt;/h2&gt;

&lt;p&gt;Context engineering is not a new idea, and MICA does not claim to have invented the conversation.&lt;/p&gt;

&lt;p&gt;The term was amplified by Andrej Karpathy and others, but the underlying practice — designing what the model sees, not just what you ask it — had already been emerging in serious AI work.&lt;/p&gt;

&lt;p&gt;Collection-first tools already existed. CTX is a useful example of that layer: it gathers relevant workspace material and delivers it to the model without manual copy-paste. It answers an important question well:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the same time, some of the sharper practitioner writing was already moving beyond collection alone. One such example was an OpenAI Developer Community post by Serge Liatko, &lt;a href="https://community.openai.com/t/prompt-engineering-is-dead-and-context-engineering-is-already-obsolete-why-the-future-is-automated-workflow-architecture-with-llms/1314011" rel="noopener noreferrer"&gt;&lt;strong&gt;“Prompt Engineering Is Dead, and Context Engineering Is Already Obsolete: Why the Future Is Automated Workflow Architecture with LLMs.”&lt;/strong&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The value of that piece was not the author's status, but the precision of the problem it named: manually maintained context eventually reaches its ceiling. &lt;/p&gt;

&lt;p&gt;Once system state changes faster than humans can keep context aligned, the real question is no longer just how to collect context, but how to automate its ownership, maintenance, and validation as the system evolves.&lt;/p&gt;

&lt;p&gt;That was an important move forward.&lt;/p&gt;

&lt;p&gt;But one layer still remained underdefined.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. The Missing Layer Was Already Visible
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdr13fr8q5dff4jqhini.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpdr13fr8q5dff4jqhini.png" alt="Gathering data vs governing data" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
The same missing layer had already shown up elsewhere.&lt;/p&gt;

&lt;p&gt;In &lt;a href="https://dev.to/flamehaven01/your-agentic-stack-has-two-layers-it-needs-three-3h1"&gt;&lt;strong&gt;Your Agentic Stack Has Two Layers. It Needs Three&lt;/strong&gt;&lt;/a&gt;, I argued that the usual stack had matured around two strong layers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MCP / tool calls — &lt;strong&gt;how&lt;/strong&gt; the agent talks to systems&lt;/li&gt;
&lt;li&gt;agent skills — &lt;strong&gt;what&lt;/strong&gt; the agent can do&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But something was still missing above both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the layer that decides &lt;strong&gt;whether&lt;/strong&gt; the agent should do it, under what constraints, and toward what end&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the layer of intent, authority, and governance.&lt;/p&gt;

&lt;p&gt;The same problem appears in context engineering.&lt;/p&gt;

&lt;p&gt;A context pipeline can be excellent at retrieval and still be weak at governance. It can gather the right files, summarize the right notes, and deliver the right-looking material — and still fail to answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what is authoritative?&lt;/li&gt;
&lt;li&gt;what is provisional?&lt;/li&gt;
&lt;li&gt;what must never be violated?&lt;/li&gt;
&lt;li&gt;what changed since last time?&lt;/li&gt;
&lt;li&gt;how does the session prove it loaded the governed archive at all?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not collection questions.&lt;/p&gt;

&lt;p&gt;They are governance questions.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Where CTX Stops and MICA Begins
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye2vmym08cd61jq7vn62.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fye2vmym08cd61jq7vn62.png" alt="Inside context engineering, beyond collection" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;CTX solves the collection problem.&lt;/p&gt;

&lt;p&gt;It answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a necessary layer. Without it, context management collapses back into manual copy-paste, repeated explanation, and fragile session startup.pro&lt;/p&gt;

&lt;p&gt;But it does not answer the next set of questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Where did this context item come from, and can that claim be verified?&lt;/li&gt;
&lt;li&gt;What happens when a file changes between sessions?&lt;/li&gt;
&lt;li&gt;Which constraints are non-negotiable, and what is the consequence when they are violated?&lt;/li&gt;
&lt;li&gt;Who approved the last change to the archive, and can that decision be audited?&lt;/li&gt;
&lt;li&gt;When the session begins, how does the system confirm the archive was actually loaded?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those are not retrieval questions.&lt;/p&gt;

&lt;p&gt;They are governance questions.&lt;/p&gt;

&lt;p&gt;That is the narrow but consequential difference.&lt;/p&gt;

&lt;p&gt;CTX collects context and delivers it.&lt;br&gt;&lt;br&gt;
MICA governs what trust that context carries, what invariants it must not violate, what happens when it changes, and how the session proves that the governed archive was actually loaded.&lt;/p&gt;

&lt;p&gt;One answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What does the AI see?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The other answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Under what rules does the AI operate — and what is the record when those rules are tested?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They are different layers. Neither replaces the other.&lt;/p&gt;


&lt;h2&gt;
  
  
  6. The Part That Was Still Open
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufv6hwpt82ohirmltadz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fufv6hwpt82ohirmltadz.png" alt="The four gates of the governance layer" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A recurring theme in serious context-engineering discussion is that context cannot remain a hand-curated artifact forever. It has to become a function of system state.&lt;/p&gt;

&lt;p&gt;But that still leaves one question open:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Who owns the specification for each step's input — and how is this versioned, tested, and audited as requirements shift?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;MICA is a concrete, working answer to that gap.&lt;/p&gt;

&lt;p&gt;Not the only possible answer. Not necessarily the final one. But a real one.&lt;/p&gt;

&lt;p&gt;Its claim is not that context engineering needed to be invented.&lt;/p&gt;

&lt;p&gt;Its claim is that context engineering still needed a governance layer with at least four properties:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;machine-addressable invariants&lt;/li&gt;
&lt;li&gt;versioned and auditable change records&lt;/li&gt;
&lt;li&gt;self-tests against the real project state&lt;/li&gt;
&lt;li&gt;declared invocation with confirmable session load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the layer MICA was built to supply.&lt;/p&gt;


&lt;h2&gt;
  
  
  7. What Governance Actually Means Here
&lt;/h2&gt;

&lt;p&gt;Governance is an overloaded word. In this context it means something specific.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Provenance&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every context item must declare where it came from in a way that can be checked.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Auditability&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Changes to the archive are recorded when they happen, not reconstructed later from memory.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invariant enforcement&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Constraints are not vague README prose. They are structured entries with identity and severity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-testing&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The archive is checked against the real project state, not only against its own internal shape.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Invocation confirmation&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The model does not silently ignore the archive. Session start requires a structured acknowledgment that the governed archive was loaded.&lt;/p&gt;

&lt;p&gt;None of these are abstract principles in MICA.&lt;br&gt;&lt;br&gt;
They are structural requirements in a running schema.&lt;/p&gt;


&lt;h2&gt;
  
  
  8. What This Is Not
&lt;/h2&gt;

&lt;p&gt;It is worth being precise about the boundary.&lt;/p&gt;

&lt;p&gt;MICA does &lt;strong&gt;not&lt;/strong&gt; generate context automatically from the codebase. That is not its job. Collection-first systems already exist, and they are valuable. MICA governs what happens once context has been identified.&lt;/p&gt;

&lt;p&gt;MICA does &lt;strong&gt;not&lt;/strong&gt; replace human judgment. A schema can require structure, audit trail, drift response, and self-tests. It cannot eliminate operator discipline. That is why the boundary between schema and playbook matters.&lt;/p&gt;

&lt;p&gt;MICA is also &lt;strong&gt;not&lt;/strong&gt; a finished system. Parts 1 through 4 of this series were explicit about what each version got wrong, what each version corrected, and what remained unresolved.&lt;/p&gt;

&lt;p&gt;That design history is part of the claim, not an embarrassment to it.&lt;/p&gt;


&lt;h2&gt;
  
  
  9. The Actual Claim
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2byeodafgh0grihwdak.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fs2byeodafgh0grihwdak.png" alt="Operative Governance" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The actual claim is not that MICA solves all of context engineering.&lt;/p&gt;

&lt;p&gt;It is this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A small operation can already run a governed AI context system — with verifiable provenance, deviation audit trail, structured invariants, self-testing, and declared invocation — without waiting for future tooling that does not yet exist.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That claim is demonstrated by the design history already covered in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt; made scoring implementable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.5&lt;/strong&gt; brought governance structure into the schema&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.7&lt;/strong&gt; made scoring fail-closed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.8&lt;/strong&gt; made invocation declared and confirmable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.8.1&lt;/strong&gt; clarified the remaining runtime ambiguities&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And in practice, governance at runtime can look as concrete as this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[SESSION READY]
Gate: PASS (self-tests: 7/7) | Track: A,B
Critical Invariants: 3/3 | Deviations: 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a critical check fails, the session does not proceed. That is what governance looks like when it becomes operative.&lt;/p&gt;

&lt;p&gt;What can be said now is narrower and more solid: the gap is real, collection-first systems solve one side of it, and MICA addresses the governance side. The conversation about what comes after context engineering was already happening; MICA is one concrete answer to that part of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. What Comes Next
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfq5h3kuzj7haxz1kbni.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkfq5h3kuzj7haxz1kbni.png" alt="From landscape to concrete operation" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Part 4 ended with a specific question:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Where does MICA sit in the context engineering landscape that already existed around it?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This post is the answer.&lt;/p&gt;

&lt;p&gt;It sits inside context engineering — but not at the collection layer.&lt;/p&gt;

&lt;p&gt;It does not compete with retrieval-first systems by trying to collect more files, pack more tokens, or automate more handoff. It begins after that layer. Its job is to govern what enters the session, what remains authoritative, what drift means, what violations matter, and how the model proves it actually loaded the governed archive at all.&lt;/p&gt;

&lt;p&gt;That is why the answer is narrower than most people expect, and more specific than most framings allow.&lt;/p&gt;

&lt;p&gt;MICA is not “the future of all context engineering.”&lt;br&gt;
It is a governance answer to the part of context engineering that collection alone does not solve.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 6&lt;/strong&gt; will move back down from landscape to concrete operation.&lt;/p&gt;

&lt;p&gt;It will show what MICA looks like in an actual project context: what a session opening report looks like, what a deviation log entry looks like in practice, and what happens when a self-test flags drift.&lt;/p&gt;

&lt;p&gt;After that comes the harder question: what remains unresolved.&lt;/p&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jw1nvooyqfi7pz0ccdk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8jw1nvooyqfi7pz0ccdk.png" alt="The named decision" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; Governance is not a layer you add after context engineering works. It is the layer that makes context engineering trustworthy — by declaring what is authoritative, recording what changes, and confirming what the AI actually loaded.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>The Model Already Read the README. MICA v0.1.8 Made It a Protocol</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 23 Mar 2026 08:20:43 +0000</pubDate>
      <link>https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9</link>
      <guid>https://dev.to/flamehaven01/the-model-already-read-the-readme-mica-v018-made-it-a-protocol-37j9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Disclosure: This article was written by the author with AI assistance for editing. All technical content, architecture decisions, and design rationale are the author's own. #ABotWroteThis&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Fail-Closed Gate&lt;/strong&gt;: An admission rule that excludes a context item if it fails a required threshold — regardless of its score on other dimensions. No exceptions. Introduced in v0.1.7.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;README-as-Protocol&lt;/strong&gt;: The pattern in which an AI session's natural behavior of reading the README first is formalized as the primary invocation mechanism. No installation required. Introduced in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation Protocol&lt;/strong&gt;: The schema-level declaration of how a MICA archive reaches an AI session — and how the session confirms it was loaded. Formalized as a required field in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Report Format&lt;/strong&gt;: The structured opening report the model must produce at session start to confirm the archive was loaded. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Design Invariant Entry&lt;/strong&gt;: A structured governance rule with identity, rule text, and severity. Replaced plain string invariants in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Self-Test Policy&lt;/strong&gt;: Machine-evaluable checks that validate the archive against the real project state — file existence, hash integrity, and README sync. Required in v0.1.8.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxio2q48c5ycqi13sv1ub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxio2q48c5ycqi13sv1ub.png" alt="Cover Image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Where Part 3 Left Off
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/the-stake-was-governance-outside-the-schema-mica-v015-pulled-it-in-46n9"&gt;Part 3&lt;/a&gt; covered v0.1.0 through v0.1.5.&lt;/p&gt;

&lt;p&gt;By that point, the schema had become implementable, then auditable. Provenance gained a registry. Changes gained an audit trail. Invariants gained structure.&lt;/p&gt;

&lt;p&gt;But two things still remained missing.&lt;/p&gt;

&lt;p&gt;Scoring was still not structurally enforced at the schema level. The formula existed, but it was still closer to convention than contract.&lt;/p&gt;

&lt;p&gt;And &lt;code&gt;invocation_protocol&lt;/code&gt; still did not exist. The archive could be perfectly governed — verified, logged, ruled — and the model could still begin work with no instruction to locate it.&lt;/p&gt;

&lt;p&gt;The governance was visible. It was not yet operative.&lt;/p&gt;

&lt;p&gt;This part covers the versions that closed both gaps — and the observation that made the harder one easier to solve than expected.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. v0.1.7: Scoring Becomes a Contract
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y2pwouk7dtmxrvcoq8q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y2pwouk7dtmxrvcoq8q.png" alt="Admission is now fail-closed" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since v0.0.1, scoring had been one of the structurally weakest parts of MICA.&lt;/p&gt;

&lt;p&gt;The earliest version used hardcoded test values with no defined combination rule. Later versions introduced an implementable formula, but the formula still lived as a declared behavior rather than a fully enforced contract. A conforming archive could still be structurally valid while leaving too much scoring behavior open to interpretation.&lt;/p&gt;

&lt;p&gt;v0.1.7 changed that in two steps.&lt;/p&gt;

&lt;p&gt;First, scoring became a structured policy.&lt;br&gt;&lt;br&gt;
Second, admission became &lt;strong&gt;fail-closed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That second change matters more than it first appears.&lt;/p&gt;

&lt;p&gt;A weighted score answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which item ranks higher?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A fail-closed gate answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Should this item be considered at all?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the question &lt;a href="https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626"&gt;Part 2&lt;/a&gt; identified as missing from the usual output-reliability framing.&lt;/p&gt;

&lt;p&gt;To illustrate the pattern, here is a &lt;strong&gt;simplified educational example&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"scoring_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"weights"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sim"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"trust"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"continuity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"..."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"gate_policy"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"trust_floor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"distilled"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"fail_behavior"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"exclude"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is not the production specification. It is a teaching example.&lt;/p&gt;

&lt;p&gt;The point is simple: once admission becomes fail-closed, scoring stops being "rank everything and hope the top item is good enough." It becomes a governed decision boundary.&lt;/p&gt;

&lt;p&gt;By v0.1.7, scoring had become a contract.&lt;/p&gt;

&lt;p&gt;But scoring was never the deepest structural gap.&lt;/p&gt;

&lt;p&gt;The deepest gap had always been invocation — how the archive reached the model at all.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The Observation That Changed the Question
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkt28ur1wwmxucmve239g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkt28ur1wwmxucmve239g.png" alt="Invocation is not an installation problem" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Before v0.1.8, invocation was framed as an installation problem.&lt;/p&gt;

&lt;p&gt;How do you make sure the model loads the archive before starting work?&lt;br&gt;
The assumed answers were plugins, tools, configuration, or external services.&lt;/p&gt;

&lt;p&gt;But that assumption missed something simpler.&lt;/p&gt;

&lt;p&gt;In many repository-based AI sessions, the README is already the model's first orientation point. Not because it is explicitly configured to read it, but because it is the natural entry surface for understanding a workspace.&lt;/p&gt;

&lt;p&gt;The invocation mechanism did not need to be invented.&lt;/p&gt;

&lt;p&gt;It needed to be declared.&lt;/p&gt;

&lt;p&gt;That insight changed the problem.&lt;/p&gt;

&lt;p&gt;If the README is already where an AI session begins, then the README can carry a structured session protocol:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;load the archive&lt;/li&gt;
&lt;li&gt;run session-start checks&lt;/li&gt;
&lt;li&gt;report readiness&lt;/li&gt;
&lt;li&gt;stop if critical governance conditions fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfmmu9wkv3ykdulmlnrw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqfmmu9wkv3ykdulmlnrw.png" alt="The natural Entry surface becomes the protocol" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;simplified educational example&lt;/strong&gt; looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;## [AI Session Protocol]&lt;/span&gt;

Before starting any work:
&lt;span class="p"&gt;
1.&lt;/span&gt; Load &lt;span class="sb"&gt;`memory/example-service.mica.json`&lt;/span&gt;
&lt;span class="p"&gt;2.&lt;/span&gt; Check critical invariants
&lt;span class="p"&gt;3.&lt;/span&gt; Run session-start self-tests
&lt;span class="p"&gt;4.&lt;/span&gt; Report session readiness
&lt;span class="p"&gt;5.&lt;/span&gt; Stop if any blocking condition is present

Required opening report:
&lt;span class="p"&gt;-&lt;/span&gt; gate
&lt;span class="p"&gt;-&lt;/span&gt; active invariants
&lt;span class="p"&gt;-&lt;/span&gt; drift status
&lt;span class="p"&gt;-&lt;/span&gt; track
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the key idea behind README-as-Protocol.&lt;/p&gt;

&lt;p&gt;The README is no longer just documentation.&lt;br&gt;
It becomes the session entry layer.&lt;/p&gt;

&lt;p&gt;README-as-Protocol is not a workaround.&lt;br&gt;
It is a recognition: the invocation mechanism already existed in practice. It simply had not yet been formalized.&lt;/p&gt;

&lt;p&gt;v0.1.8 turned that recognition into a schema-level contract.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. v0.1.8: The Schema Reaches the Model
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferxcr5b4ws8g6rmt8lcz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ferxcr5b4ws8g6rmt8lcz.png" alt="Five pillars of a runtime governance contract" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this stage, the important change was not one field. It was the structural shift.&lt;/p&gt;

&lt;p&gt;The archive no longer just described governance.&lt;br&gt;
It declared how governance entered the session.&lt;/p&gt;

&lt;p&gt;In simplified terms, v0.1.8 added &lt;strong&gt;five kinds of structure&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the archive declares &lt;strong&gt;how it reaches the model&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;the model must &lt;strong&gt;prove it loaded the archive&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;invariants become &lt;strong&gt;structured governance objects&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;the archive declares &lt;strong&gt;how drift is handled&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;the archive can &lt;strong&gt;validate itself against the real project state&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The following simplified example illustrates &lt;strong&gt;three of those five additions&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"project"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"example-service"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"design_invariants"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DI-001"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Do not introduce undeclared dependencies."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"critical"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invocation_protocol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"primary_pattern"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"readme_protocol"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"session_report_format"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"trigger"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"session_start"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"required_fields"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"active_invariants"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"track"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example leaves out most of the real system. That is intentional.&lt;/p&gt;

&lt;p&gt;The purpose is to show the architectural shift:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;design_invariants&lt;/code&gt; says what must remain true&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;invocation_protocol&lt;/code&gt; says how the archive reaches the session&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;session_report_format&lt;/code&gt; says how the model proves it loaded the archive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;id&lt;/code&gt; makes the invariant referenceable in audit output, and &lt;code&gt;severity&lt;/code&gt; determines how violations are handled.&lt;/p&gt;

&lt;p&gt;Before v0.1.8, a session could begin with a perfectly governed archive existing somewhere in the repository — but with no requirement that the model actually load it.&lt;/p&gt;

&lt;p&gt;After v0.1.8, the archive no longer merely existed.&lt;/p&gt;

&lt;p&gt;It had a declared path into the session.&lt;/p&gt;

&lt;p&gt;And the session was not considered ready until it reported that the archive had been loaded.&lt;/p&gt;

&lt;p&gt;That is the moment where governance stopped being documentation and became runtime behavior.&lt;/p&gt;

&lt;p&gt;Files change. Hashes drift. Before this layer existed, that drift could remain silent. Now the schema requires a declared response when the archive no longer matches the project it describes.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. v0.1.8.1: Precision Patches
&lt;/h2&gt;

&lt;p&gt;Once the major shift was in place, several smaller ambiguities appeared in practice.&lt;/p&gt;

&lt;p&gt;These did not change the architecture.&lt;br&gt;
They clarified how the architecture behaved.&lt;/p&gt;

&lt;p&gt;In simplified terms, the patches did three things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;they made self-tests more explicitly machine-evaluable&lt;/li&gt;
&lt;li&gt;they declared which runtime was responsible for running those tests&lt;/li&gt;
&lt;li&gt;they clarified authority between invariant definitions and derived track views so drift could be detected instead of silently accumulating&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These are not the kinds of changes that make headlines.&lt;/p&gt;

&lt;p&gt;They are the kinds of changes that make a governance system stable enough to survive repeated use.&lt;/p&gt;

&lt;p&gt;That is what patch releases are for.&lt;/p&gt;


&lt;h2&gt;
  
  
  6. Playbook: What the Schema Cannot Enforce
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfhwbjc7u13rgk03xnii.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgfhwbjc7u13rgk03xnii.png" alt="Schema enforces structure" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this point, a second document became necessary: the Playbook.&lt;/p&gt;

&lt;p&gt;The schema can enforce structure:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;required fields&lt;/li&gt;
&lt;li&gt;valid shapes&lt;/li&gt;
&lt;li&gt;invocation declaration&lt;/li&gt;
&lt;li&gt;opening reports&lt;/li&gt;
&lt;li&gt;self-tests&lt;/li&gt;
&lt;li&gt;drift behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But it cannot enforce operator judgment:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;when to use which invocation pattern&lt;/li&gt;
&lt;li&gt;how to update invariants safely&lt;/li&gt;
&lt;li&gt;what counts as a complete archive&lt;/li&gt;
&lt;li&gt;which shortcuts are acceptable&lt;/li&gt;
&lt;li&gt;how to avoid human-caused drift&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So the division became explicit.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The schema enforces structure.&lt;br&gt;
The Playbook enforces judgment.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is not a limitation. It is a design boundary.&lt;/p&gt;

&lt;p&gt;A governance system that tries to encode all human discipline into the schema becomes brittle and unmaintainable. Some rules belong in machine-validated structure. Others belong in operator practice.&lt;/p&gt;

&lt;p&gt;A simplified educational example makes the difference concrete.&lt;/p&gt;

&lt;p&gt;The schema requires that invocation be declared:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"invocation_protocol"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"primary_pattern"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"readme_protocol"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Playbook explains how to use it correctly:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use README-as-Protocol as the default starting mode.&lt;/li&gt;
&lt;li&gt;Do not submit a partially completed archive to the validator as if it were final.&lt;/li&gt;
&lt;li&gt;Treat invariant track assignment as the single source of truth.&lt;/li&gt;
&lt;li&gt;Validate in this order: schema, invocation completeness, self-test coverage, provenance completeness, then track consistency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The schema can require that invocation exists.&lt;br&gt;
It cannot require that the operator uses it well.&lt;/p&gt;

&lt;p&gt;README-as-Protocol works for the widest range of projects without requiring anything beyond a text file. That makes it the recommended default. But it is not the only option. More installed or infrastructure-heavy approaches may be better in some environments. The Playbook documents when the default is no longer enough — and what to do next.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. The Distance from v0.0.1 to an Operative Schema
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9tzy2kq7pxude63q63s.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9tzy2kq7pxude63q63s.png" alt="The evolution of operative governance" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626"&gt;Part 2&lt;/a&gt; named three failures in v0.0.1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 1: Scoring with no defined semantics.&lt;/strong&gt;&lt;br&gt;
The earliest version used hardcoded values with no clear combination rule or enforced model.&lt;br&gt;
→ Closed in &lt;strong&gt;v0.1.0&lt;/strong&gt; (implementable formula) and &lt;strong&gt;v0.1.7&lt;/strong&gt; (structured scoring plus fail-closed admission).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 2: Invariants encoded as comments.&lt;/strong&gt;&lt;br&gt;
The earliest version used plain strings where governance objects were needed.&lt;br&gt;
→ Partially closed in &lt;strong&gt;v0.1.5&lt;/strong&gt;. Fully closed in &lt;strong&gt;v0.1.8&lt;/strong&gt; when invariants became structured, referenceable, severity-bearing entries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Failure 3: No path to the model.&lt;/strong&gt;&lt;br&gt;
The archive existed, but the model had no reliable way to know it existed.&lt;br&gt;
→ Closed in &lt;strong&gt;v0.1.8&lt;/strong&gt; when invocation became a declared protocol and session start required proof of archive load.&lt;/p&gt;

&lt;p&gt;That progression can be summarized simply:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The schema existed.&lt;br&gt;
Then governance moved into the schema.&lt;br&gt;
Then the schema reached the model.&lt;br&gt;
Only then did governance become operative.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is the real distance from v0.0.1 to this point.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Where This Leaves MICA
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dsx18dep0v2to8bh2kb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4dsx18dep0v2to8bh2kb.png" alt="Mica is no longer just a memory format" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At this stage, MICA is no longer just a memory format.&lt;/p&gt;

&lt;p&gt;It is no longer just a structured archive.&lt;/p&gt;

&lt;p&gt;It is a runtime governance contract:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what context is allowed in&lt;/li&gt;
&lt;li&gt;what must remain true&lt;/li&gt;
&lt;li&gt;how the archive is loaded&lt;/li&gt;
&lt;li&gt;how drift is handled&lt;/li&gt;
&lt;li&gt;how readiness is reported before work begins&lt;/li&gt;
&lt;/ul&gt;




&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;That is a different category of system.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  9. What Comes Next
&lt;/h2&gt;

&lt;p&gt;The next part of this series steps back from version history and asks a broader question:&lt;/p&gt;

&lt;p&gt;Where does MICA sit in the context engineering landscape that already existed around it?&lt;/p&gt;

&lt;p&gt;The answer is narrower than most people expect, and more specific than most framings allow.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named decision from this post:&lt;/strong&gt; An AI session already uses the README as an orientation surface. Formalizing that behavior as a schema-level invocation protocol is not a trick. It is the recognition that the mechanism already existed — and needed to be declared, not invented.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>The Stake Was Governance Outside the Schema. MICA v0.1.5 Pulled It In</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Sat, 21 Mar 2026 17:09:51 +0000</pubDate>
      <link>https://dev.to/flamehaven01/the-stake-was-governance-outside-the-schema-mica-v015-pulled-it-in-46n9</link>
      <guid>https://dev.to/flamehaven01/the-stake-was-governance-outside-the-schema-mica-v015-pulled-it-in-46n9</guid>
      <description>&lt;h2&gt;
  
  
  Glossary: terms used in this article
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Provenance Registry&lt;/strong&gt;: A structured, hash-anchored record of where each context item came from. Requires &lt;code&gt;uri&lt;/code&gt;, &lt;code&gt;sha256&lt;/code&gt;, &lt;code&gt;kind&lt;/code&gt;, &lt;code&gt;created_at&lt;/code&gt;, and &lt;code&gt;trust_class&lt;/code&gt;. Formalized as a required field in v0.1.5.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Deviation Log&lt;/strong&gt;: An auditable record of every change to a governed archive. Each entry requires &lt;code&gt;before_hash&lt;/code&gt;, &lt;code&gt;after_hash&lt;/code&gt;, &lt;code&gt;gate&lt;/code&gt;, &lt;code&gt;approved_by&lt;/code&gt;, and &lt;code&gt;rollback_ready&lt;/code&gt;. Formalized in v0.1.5.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Semantic Validation Policy&lt;/strong&gt;: A machine-evaluable rule set applied to context items. Each rule requires &lt;code&gt;id&lt;/code&gt;, &lt;code&gt;expression&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, and &lt;code&gt;on_fail&lt;/code&gt;. Replaces string invariants in v0.1.5.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Semantic Collapse&lt;/strong&gt;: The pattern in which a JSON Schema specification is applied to an LLM as a runtime contract rather than as a validator. The LLM executes the schema semantically and produces a contract-compliant artifact. First demonstrated in v0.1.4.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation&lt;/strong&gt;: The mechanism by which a MICA archive reaches an AI session. Still absent at v0.1.5. Formalized as a required field in v0.1.8.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hhu3bwt8gsjpkf5zpu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hhu3bwt8gsjpkf5zpu0.png" alt="Cover image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Where Part 2 Left Off
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626"&gt;Part 2&lt;/a&gt; documented three failures in v0.0.1: scoring with no defined semantics, invariants encoded as comments rather than constraints, and no path for the archive to reach the model at all.&lt;/p&gt;

&lt;p&gt;The first two were fixable by adding structure. The third was different in kind — a schema without an enforcement path is not a governance schema at all.&lt;/p&gt;

&lt;p&gt;This part covers v0.1.0 through v0.1.5. What the intermediate versions fixed. What experiment revealed the need for v0.1.5. And what was still missing at the end of it.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The Stake Nobody Pulled
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7bt0hzdjnpo6vun816e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl7bt0hzdjnpo6vun816e.png" alt="We assumed governance lived outside the schema" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is an old story about elephants. A young elephant chained to a small stake learns it cannot move. Years later, fully grown, it still does not pull — not because the stake holds, but because the boundary became an unquestioned assumption.&lt;/p&gt;

&lt;p&gt;In practice, what I kept seeing across LLM tooling discussions was exactly that. Technical provenance tracked inside the system, governance detail left in documents. Who approved a change, under what conditions, with what rationale — always a README, a review checklist, a comment. Never the schema. Semantic rules written as strings, described in documentation, reviewed by humans. Everyone knew the constraint. Nobody encoded it.&lt;/p&gt;

&lt;p&gt;The stake was real. It was not immovable.&lt;/p&gt;

&lt;p&gt;v0.1.0 through v0.1.4 did not question it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The stake was a single assumption: governance lived outside the schema.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. v0.1.0 to v0.1.4: More Implementable, Not Yet Inspectable
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofik5a6q082zkr5wethk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fofik5a6q082zkr5wethk.png" alt="Implementable, but not inspectable" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tvshg1d0uukl6dcy7k9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7tvshg1d0uukl6dcy7k9.png" alt="Implementable, but not instpectable" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These versions addressed the first two failures from v0.0.1. In brief:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt;: Scoring became implementable — hardcoded hints replaced with an explicit weighted formula: &lt;code&gt;clamp01(0.55 * sim + 0.15 * recency + 0.10 * invoke + 0.10 * trust + 0.10 * continuity)&lt;/code&gt;. Two implementations now produce the same number for the same inputs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt;: &lt;code&gt;invoke_role&lt;/code&gt; semantics defined — each role got explicit score bonus, pin behavior, compile behavior, and eviction priority.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt;: Eviction became a five-phase strategy — budget overflow had a defined sequence and a failure condition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.0&lt;/strong&gt;: Error handling defined — eight failure modes, each with an explicit action and severity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;v0.1.4&lt;/strong&gt;: Description updated to include &lt;code&gt;"semantic enforcement"&lt;/code&gt; — required field count unchanged. The label was ahead of the implementation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These versions made the schema reproducible to implement. They did not make governance inspectable. &lt;code&gt;design_invariants&lt;/code&gt; remained a plain string array. &lt;code&gt;scoring_policy.function&lt;/code&gt; remained a free string in the JSON Schema. &lt;code&gt;provenance_registry&lt;/code&gt; did not exist. The assumption from Section 2 was still in place.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. The Experiment That Revealed the Gap
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcawxyh7f9lqf9tqxtknz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcawxyh7f9lqf9tqxtknz.png" alt="The schema was executable, but..." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In March 2026, a semantic collapse experiment was run against v0.1.4. The procedure was direct: provide the schema and session context to an LLM, instruct it to execute the schema semantically, and verify the output against all required fields.&lt;/p&gt;

&lt;p&gt;The result was a fully contract-compliant artifact. Zero field violations. Zero type constraint violations. The scoring pipeline ran. The eviction log was correct. The audit trail was structurally preserved. The schema worked — as a semantic contract against an LLM runtime.&lt;/p&gt;

&lt;p&gt;But the artifact also contained this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"provenance"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"uri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mica://memory/flamehaven-labs/project/identity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"source_kind"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"memory_export"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"trust_class"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"canonical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"collected_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"UNKNOWN"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"content_hash"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256:project_identity_anchor"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;"collected_at": "UNKNOWN"&lt;/code&gt;. Every memory-tier item in the artifact carried this. The timestamp was not available. The schema had no mechanism to require it. The provenance claim could not be verified — not because the data was missing, but because the schema had no structure to anchor it.&lt;/p&gt;

&lt;p&gt;The experiment proved the schema was executable. It also proved that a contract-compliant archive could contain provenance that was structurally present but epistemically empty.&lt;/p&gt;

&lt;p&gt;The same experiment revealed two more gaps. Design invariants were strings — the LLM read them, interpreted them, and applied them as best it could. But there was no &lt;code&gt;id&lt;/code&gt; to reference in audit output, no &lt;code&gt;severity&lt;/code&gt; to determine handling, no &lt;code&gt;on_fail&lt;/code&gt; to define consequence. And every change to the archive between sessions had no record. No &lt;code&gt;before_hash&lt;/code&gt;. No &lt;code&gt;approved_by&lt;/code&gt;. No &lt;code&gt;rollback_ready&lt;/code&gt;. The archive changed silently.&lt;/p&gt;

&lt;p&gt;The semantic collapse experiment proved the schema was implementable. It also produced a precise map of what governance was still missing.&lt;/p&gt;

&lt;p&gt;That map became v0.1.5.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. The Question v0.1.5 Asked
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgdbmay9lr7pcdowskogx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgdbmay9lr7pcdowskogx.png" alt="what if governance belongs inside the schema?" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v0.1.5 did not ask "what else should we add?"&lt;/p&gt;

&lt;p&gt;It asked: &lt;strong&gt;what if governance itself belongs inside the schema?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not as a description of what enforcement should look like. As a machine-evaluable structure that enforces it.&lt;/p&gt;

&lt;p&gt;MICA v0.1.0 through v0.1.4 had followed the same pattern common in practice — governance detail kept implicit, schema focused on structure and scoring. The experiment made the cost of that pattern visible. v0.1.5 pulled three assumptions out of documentation and forced them into machine-readable structure.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. What v0.1.5 Pulled In
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6x110hgaba4ng800pxw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb6x110hgaba4ng800pxw.png" alt="From implicit documentation to machine-evaluable structrure" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stake 1: Provenance was a field. v0.1.5 made it a registry.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The experiment artifact had &lt;code&gt;"collected_at": "UNKNOWN"&lt;/code&gt; on every memory item. Before v0.1.5, there was no mechanism to prevent that — or detect it.&lt;/p&gt;

&lt;p&gt;v0.1.5 introduced &lt;code&gt;provenance_registry&lt;/code&gt; as a required top-level structure. Each record:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"uri"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sha256"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"kind"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"created_at"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"trust_class"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;sha256&lt;/code&gt; anchored the source to a specific state. &lt;code&gt;trust_class&lt;/code&gt; declared reliability at registration — not inferred at use. &lt;code&gt;created_at&lt;/code&gt; was now required — not optional, not &lt;code&gt;"UNKNOWN"&lt;/code&gt;. A context item's provenance claim could now be checked against the registry. If the hash did not match, the claim was false.&lt;/p&gt;

&lt;p&gt;Provenance had been a field on a document. v0.1.5 made it a verifiable record.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stake 2: Changes happened. v0.1.5 made them auditable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Between sessions, the archive changed. There was no record of what changed, when, under whose authority, or whether it could be undone.&lt;/p&gt;

&lt;p&gt;v0.1.5 introduced &lt;code&gt;deviation_log&lt;/code&gt; as a required field. Each entry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"change_id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"timestamp_utc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="s2"&gt;"before_hash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"after_hash"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"approved_by"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rollback_ready"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;before_hash&lt;/code&gt; and &lt;code&gt;after_hash&lt;/code&gt; meant every change left a cryptographic record. &lt;code&gt;gate&lt;/code&gt; associated the change with a defined governance checkpoint. &lt;code&gt;approved_by&lt;/code&gt; declared authority explicitly. &lt;code&gt;rollback_ready&lt;/code&gt; required the system to declare whether it could undo the change before committing it.&lt;/p&gt;

&lt;p&gt;Change history had been implicit. v0.1.5 made it a required audit trail.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stake 3: Invariants were strings. v0.1.5 made them rules.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The experiment applied string invariants as best as the model could interpret them. There was no &lt;code&gt;id&lt;/code&gt; to reference in audit output, no &lt;code&gt;severity&lt;/code&gt; to triage violations, no &lt;code&gt;on_fail&lt;/code&gt; to define consequence.&lt;/p&gt;

&lt;p&gt;v0.1.5 introduced &lt;code&gt;semantic_validation_policy&lt;/code&gt; with structured rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"expression"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"severity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"on_fail"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each rule had an &lt;code&gt;id&lt;/code&gt; — referenceable in audit output. Each had a &lt;code&gt;severity&lt;/code&gt; — determining how a violation was handled. Each had an &lt;code&gt;on_fail&lt;/code&gt; — defining the consequence. A violation could now be detected, logged, and traced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82dirh52ir3m7sgvkbld.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F82dirh52ir3m7sgvkbld.png" alt="invariants are no longer strings." width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;String invariants described what the system must not do. Semantic rules could detect when it did.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. What v0.1.5 Still Did Not Have
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw488sucj15flz6mh03qk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw488sucj15flz6mh03qk.png" alt="The governance was visible but not yet operative" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two things remained missing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scoring was still a free string at the schema level.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;scoring_policy.function&lt;/code&gt; was still &lt;code&gt;{ "type": "string" }&lt;/code&gt; in the JSON Schema. The weighted formula from v0.1.0 was a specified convention — implementors who followed it produced the same number. But the schema could not enforce that they did. A conforming archive could contain any string in that field. The formula was real. The contract was not.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The archive still had no path to the model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;invocation_protocol&lt;/code&gt; did not exist in v0.1.5. The provenance registry was hash-anchored. The deviation log was auditable. The semantic rules were machine-evaluable. A session could start with a perfectly governed archive — verified, logged, ruled — and the model would begin work with no instruction to locate it.&lt;/p&gt;

&lt;p&gt;The governance was visible. It was not yet operative.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. What This Series Covers Next
&lt;/h2&gt;

&lt;p&gt;Part 1 defined the problem.&lt;/p&gt;

&lt;p&gt;Part 2 documented v0.0.1 — what it was and why its most fundamental failure took the longest to fix.&lt;/p&gt;

&lt;p&gt;This part covered v0.1.0 through v0.1.5: how the schema became implementable, then auditable, and what it still could not do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 4&lt;/strong&gt; covers v0.1.7 and v0.1.8: the fail-closed gate, the provenance enforcement boundary, and the version that finally made invocation a required field — the stake that had been in the ground since v0.0.1.&lt;/p&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtpq57rzu101i4pp5atk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwtpq57rzu101i4pp5atk.png" alt="final thought" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>architecture</category>
      <category>contextengineering</category>
    </item>
    <item>
      <title>Medical AI Repositories Need More Than Benchmarks. We Built STEM-AI to Audit Trust</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Fri, 20 Mar 2026 17:30:52 +0000</pubDate>
      <link>https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f</link>
      <guid>https://dev.to/flamehaven01/medical-ai-repositories-need-more-than-benchmarks-we-built-stem-ai-to-audit-trust-194f</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn54npnrc11jo9gwvkwhk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn54npnrc11jo9gwvkwhk.png" alt="surface polish does not equal clinical safety" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you have been paying attention to GitHub recently — the past six months — you have seen the pattern.&lt;/p&gt;

&lt;p&gt;A new bio-AI repository appears. It promises to automate genomic analysis, drug discovery, medical imaging, or clinical data interpretation. The README is polished. The architecture diagram looks serious. Within weeks it has hundreds of stars, a few forks, and a preprint on bioRxiv.&lt;/p&gt;

&lt;p&gt;Then nothing.&lt;/p&gt;

&lt;p&gt;No CI. No CHANGELOG. No response to issues. No clear statement of limitations. No clinical disclaimer anywhere in the repository. And a tool that may now be touching patient-adjacent workflows has exactly one quality gate: the author thought it was ready.&lt;/p&gt;

&lt;p&gt;We are developers. We have built things in that atmosphere too. At some point, we had to ask a harder question than benchmark accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When a bio-AI repository gets close to real diagnostic, genomic, imaging, or therapeutic workflows — what does "trustworthy enough for serious review" actually mean?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The field has benchmarks for models. It has almost no shared standards for repository accountability.&lt;/p&gt;

&lt;p&gt;So we built one.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Makes This Moment Different
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic1knigacj3qxkxmqppt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fic1knigacj3qxkxmqppt.png" alt="The cost of being wrong is fundamentally different here" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bio-AI is now filling up with skill libraries, agent wrappers, orchestration pipelines, and plugin-style marketplaces that look far more deployable than they actually are. Surface maturity is easy to fake. A clean README, a marketplace entry, or a rising GitHub star count can make a repository look trustworthy long before it has earned that trust.&lt;/p&gt;

&lt;p&gt;And this category cannot be judged like ordinary software.&lt;/p&gt;

&lt;p&gt;If a note-taking app breaks, users get frustrated.&lt;br&gt;
If an internal dashboard fails, a team loses time.&lt;br&gt;
But when a biomedical or medical-AI repository fails quietly, the consequences do not stop at software quality.&lt;/p&gt;

&lt;p&gt;A flawed genomics pipeline can distort interpretation.&lt;br&gt;
A weak clinical model can normalize unsafe confidence.&lt;br&gt;
A drug-discovery system can push the wrong candidates forward, bury better ones, and send time, capital, and downstream validation effort in the wrong direction.&lt;/p&gt;

&lt;p&gt;In this category, failure is not just a debugging problem.&lt;br&gt;
It is a patient-safety problem and a resource-allocation problem.&lt;/p&gt;

&lt;p&gt;That is why these systems require more than code inspection. They require scrutiny of documentation, limits, provenance, maintenance behavior, and public claims — because the cost of being wrong is fundamentally different here.&lt;/p&gt;


&lt;h2&gt;
  
  
  What STEM-AI Is
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma18rmc5secs0oebheri.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fma18rmc5secs0oebheri.png" alt="what steam-ai is" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;STEM-AI (Sovereign Trust Evaluator for Medical AI)&lt;/strong&gt; is a governance audit framework for public bio/medical AI repositories.&lt;/p&gt;

&lt;p&gt;It does &lt;strong&gt;not&lt;/strong&gt; ask whether a project sounds impressive.&lt;/p&gt;

&lt;p&gt;It asks whether the repository shows observable signs of responsible engineering:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;honest documentation&lt;/li&gt;
&lt;li&gt;consistent public claims&lt;/li&gt;
&lt;li&gt;maintenance discipline&lt;/li&gt;
&lt;li&gt;biological data responsibility&lt;/li&gt;
&lt;li&gt;explicit acknowledgement of limits&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That distinction matters. A repository can look technically sophisticated and still fail the most basic governance test for patient-adjacent use. A polished README is not a safety surface by itself.&lt;/p&gt;

&lt;p&gt;STEM-AI is &lt;strong&gt;not&lt;/strong&gt; a regulatory verdict, clinical certification, or legal assessment. It is a structured review framework designed to give researchers, reviewers, procurement teams, and engineers a more reproducible starting point for discussion. The canonical spec requires a non-waivable disclaimer in every output stating exactly that.&lt;/p&gt;

&lt;p&gt;STEM-AI is meant to support serious review, not replace it.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why We Made It LLM-Native
&lt;/h2&gt;

&lt;p&gt;STEM-AI runs as a structured specification executed by a major LLM.&lt;/p&gt;

&lt;p&gt;The spec is the program. The LLM is the runtime.&lt;/p&gt;

&lt;p&gt;That sounds strange until you look at the design constraint. We did not want a system that "vibes" its way to a trust score. We wanted a system that forces checklist-based scoring, explicit evidence chains, N/A handling for missing data, and hard floors for catastrophic claims. The goal is to reduce evaluator drift by replacing narrative judgment with traceable rubric logic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model is not supposed to "feel" trust.&lt;br&gt;
It is supposed to count evidence.&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycgx2uf8b7vdevu2tvxv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fycgx2uf8b7vdevu2tvxv.png" alt="how it works" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;STEM-AI evaluates a repository across three stages. Each stage has a defined checklist. Each score must map back to observable evidence.&lt;/p&gt;

&lt;p&gt;These three stages ask three different questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what the repository says&lt;/li&gt;
&lt;li&gt;what public communication says&lt;/li&gt;
&lt;li&gt;what the codebase actually proves&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;
  
  
  Stage 1 — README Intent
&lt;/h3&gt;

&lt;p&gt;If a repository is patient-adjacent, the README is not marketing. It is the first governance surface.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WHAT WE LOOK FOR (positive signals):

  [R1] Does a Limitations or Known Issues section exist
       and cover a substantial portion of the README
       with specific, actionable content?

  [R2] Is any regulatory framework cited?
       (FDA SaMD guidelines, CE marking, IRB requirements,
        or equivalent)

  [R3] Is there a clinical disclaimer?
       ("This tool is NOT a substitute for clinical judgment"
        or equivalent)

  [R4] Are demographic bias limits or applicable population
       boundaries disclosed?

  [R5] Are reproducibility provisions present?
       (environment pinning, data version, seed values)

WHAT WE PENALIZE (negative signals):

  [H1] Performance superlatives without benchmark comparison
       "SOTA", "State-of-the-Art", "Best-in-class"

  [H2] Unsubstantiated innovation claims
       "Revolutionary", "Groundbreaking", "Game-changing"

  [H3] Fully autonomous framing in a clinical context
       "Fully Automated", "Zero Human Oversight"
       without any stated supervision requirement

  [H4] AGI or human-level capability claims
       "AGI", "Human-level", "Surpasses clinicians"

  [H5] Social proof substituted for technical credibility
       GitHub stars or download counts cited as a trust signal

  [H6] External optics substituted for technical evidence
       VC funding or press coverage without validation

![The Patient safety trapdoor](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j4go8oh9dn80at4x4npu.png)

CRITICAL RULE:
  IF H3 AND H4 are both triggered simultaneously:
    → Hard floor activated.
    → Final Score = 0. Tier = T0. No further scoring.
    → A tool claiming fully automated clinical operation
      at AGI level is a patient safety issue,
      not a scoring edge case.
      No positive signals elsewhere can override this.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;On clinical adjacency:&lt;/strong&gt; STEM-AI does not assume all biomedical repositories carry the same deployment risk. The disclosure bar rises as a tool moves closer to patient-facing or decision-shaping use. If the repository contains medical imaging frameworks, drug docking engines, diagnostic genomics pipelines, or clinical language models, the absence of R2 and R3 is not a missed bonus — it becomes an active penalty.&lt;/p&gt;




&lt;h3&gt;
  
  
  Stage 2 — Cross-Platform Consistency
&lt;/h3&gt;

&lt;p&gt;This stage is less about marketing tone and more about contradiction. If a README warns carefully but public posts erase those warnings, the repository's governance surface is inconsistent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WHAT WE PENALIZE:

  [F1] Stars or downloads used as the primary message
       on external platforms, without technical substance

  [F2] Public dismissal or hostile response to
       critical feedback from the community

  [F3] Vanity metrics presented as clinical trust proxies

  [F4] External posts that directly contradict or omit
       warnings stated in the README
       → Most serious flag in Stage 2.
         A README that says "not for clinical use" while
         the author's LinkedIn announces clinical deployment
         is a consistency failure.

WHAT WE REWARD:

  [A1] Model errors or hallucination cases
       openly shared in public forums

  [A2] Reproducibility failures or known failure modes
       transparently acknowledged outside the repository

  [A3] Critical external feedback accepted and incorporated

  [A4] Regulatory or ethics body collaboration referenced
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This stage is only as strong as the public evidence available. When live cross-platform data is unavailable, Stage 2 goes N/A and its weight moves to Stages 1 and 3. In medical-adjacent evaluation, pretending to know is worse than stating you do not.&lt;/p&gt;




&lt;h3&gt;
  
  
  Stage 3 — Code Infrastructure and Biological Integrity
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtozugmzpzfaxtbacl8m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frtozugmzpzfaxtbacl8m.png" alt="verifying the proof behind the polish" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is where the framework becomes more than a documentation audit.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TECHNICAL RESPONSIBILITY:

  [T1] CI/CD pipeline
       Does automated testing run on push or PR?
       Is coverage stated or just implied?

  [T2] Domain-specific regression tests
       Not just unit tests — tests that verify biological
       or clinical correctness invariants.

  [T3] CHANGELOG transparency
       Does one exist?
       Does it document bugs and failures honestly,
       or only feature additions?
       Silent updates on a clinical-adjacent tool are
       a different risk profile than an honest bug log.

  [T4] Issue and PR response pattern
       When did the last maintainer response happen?
       What proportion of issues receive a response?

BIOLOGICAL INTEGRITY:

  [B1] Data provenance and consent
       Where did the training or reference data come from?
       Is there an IRB approval or data consent declaration?

  [B2] Algorithmic bias disclosure
       Are performance limits across demographic subgroups
       disclosed? Are actual measurements provided,
       or only a general acknowledgment?

  [B3] Conflict of interest transparency
       Is the funding source disclosed?
       If there is commercial interest, is it stated?

TRAJECTORY SIGNAL:
  v1.0.4 adds a trajectory modifier comparing issue close
  rates and release frequency across two consecutive
  90-day windows.

  This signal is deliberately bounded: it changes Stage 3
  by at most ±5 points, which translates to roughly ±2
  points on the final weighted score.
  Large enough to matter near tier boundaries.
  Small enough not to distort the audit.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What the Output Looks Like
&lt;/h2&gt;

&lt;p&gt;Every STEM-AI audit produces a structured report. Here is an illustrative example — repository name withheld, structure unchanged:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gh"&gt;# 🩺 STEM-AI Audit Report v1.0.4&lt;/span&gt;
──────────────────────────────────────────────
Target:        [Repository Name]
Audit Date:    2026-03-20
Report Expiry: 2026-09-16
Flags:         CLINICAL_ADJACENT: true
               NASCENT_REPO: false
               T0_HARD_FLOOR: false
──────────────────────────────────────────────
Stage 1 — README Intent        47 / 100  VC Pitch
Stage 2 — Cross-Platform       N/A       (MANUAL mode)
Stage 3 — Code Infrastructure  10 / 100  Hit-and-Run
──────────────────────────────────────────────
Final Score:   28 / 100
Tier:          T0 REJECTED
USE_SCOPE:     None — clinical use prohibited
──────────────────────────────────────────────
Priority Remediation:
&lt;span class="p"&gt;  1.&lt;/span&gt; Add clinical disclaimer (active penalty applied —
     clinical-adjacent tooling without R3)
&lt;span class="p"&gt;  2.&lt;/span&gt; Implement CI/CD pipeline (T1: 0 pts)
&lt;span class="p"&gt;  3.&lt;/span&gt; Disclose data provenance and IRB status (B1: 0 pts)
──────────────────────────────────────────────
⚠ This report is LLM-generated. It is not a regulatory
  determination. Report expires: 2026-09-16.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expiry date is not cosmetic. A report based on a repository that went inactive months ago should not circulate in procurement pipelines as if it still describes current reality. Version 1.0.4 computes that date from recent project activity automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the Tiers Mean — and What They Do Not
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpi5linfr9tmpcitsr5e6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpi5linfr9tmpcitsr5e6.png" alt="The STEAM-AI Tier Matrix" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;T0 Rejected:        Trust not established
                    Clinical use prohibited

T1 Quarantine:      High risk
                    Independent verification required

T2 Caution:         Research reference only
                    Clinical automation forbidden

T3 Review:          Supervised pilot eligible
                    Human oversight mandatory

T4 (highest tier):  Strongest observed governance signals
                    Still requires independent expert review
                    and formal regulatory clearance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Even at the highest tier, STEM-AI does not substitute for clinical validation, expert review, or regulatory clearance. That is not a disclaimer added to soften the framework. It is how the framework was designed.&lt;/p&gt;




&lt;h2&gt;
  
  
  One Design Decision That Matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmkd4kdz8euf1stkye0r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmkd4kdz8euf1stkye0r.png" alt="Engineering integrity supersedes domain pedigree" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Author background does not affect the score.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Whether the author comes from biology, medicine, ML research, or pure software engineering is recorded as contextual information for human reviewers. It is explicitly non-scoring and carries a mandatory bias warning in every report.&lt;/p&gt;

&lt;p&gt;Domain credentials are not the same thing as engineering integrity. And lack of domain pedigree does not imply carelessness. STEM-AI is built to evaluate observable repository governance — not to sort developers by prestige.&lt;/p&gt;




&lt;h2&gt;
  
  
  What STEM-AI Refuses To Be
&lt;/h2&gt;

&lt;p&gt;A trust framework for medical AI can become dangerous if it turns into a personal attack engine.&lt;/p&gt;

&lt;p&gt;STEM-AI is constrained on purpose. It is limited to public professional repositories and public professional material. It forbids PII inference, private-account speculation, and individual profiling as a use case. Without that boundary, a trust evaluator becomes a harassment tool.&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Questions This Framework Is Built Around
&lt;/h2&gt;

&lt;p&gt;If you remember nothing else from this framework, remember these:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Did the repository describe its limits honestly?&lt;/li&gt;
&lt;li&gt;Did public communication remain consistent with the repository's stated limits?&lt;/li&gt;
&lt;li&gt;Did the codebase show evidence of maintenance and biological responsibility?&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those are not performance questions. They are accountability questions. For tools that sit upstream of clinical decisions, accountability is not optional.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Comes Tomorrow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbelheefsc472rgc7rtl7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbelheefsc472rgc7rtl7.png" alt="what to do tomorrow" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tomorrow we publish the first audit set from STEM-AI v1.0.4 across 10 open-source bio-AI repositories — including projects from research institutions, an actively used SaaS platform, and agent-style bioinformatics tooling running inside containerized environments.&lt;/p&gt;

&lt;p&gt;What we can say now:&lt;/p&gt;

&lt;p&gt;The strongest-looking repositories are not always the most accountable. In at least one case, a repository with solid engineering signals still falls short because a critical disclosure is missing from the surface a reviewer would actually read first.&lt;/p&gt;

&lt;p&gt;In another, the safety control exists in the code but fails as governance because it does not activate under default deployment.&lt;/p&gt;

&lt;p&gt;In another, the generation mechanism itself is not the problem. The problem is that users were never clearly told what it was. That is a disclosure failure, not a technical failure. The distinction matters.&lt;/p&gt;

&lt;p&gt;The full breakdown publishes tomorrow.&lt;/p&gt;

&lt;p&gt;Which bio-AI repositories would you want audited next? Drop them in the comments.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;STEM-AI v1.0.4 — full audit results tomorrow.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Code works. But does the author care about the patient?"&lt;/em&gt;&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>medicalai</category>
      <category>aigovernance</category>
      <category>healthtech</category>
    </item>
    <item>
      <title>The Schema Existed. The Model Had No Way to Know.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Thu, 19 Mar 2026 17:11:28 +0000</pubDate>
      <link>https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626</link>
      <guid>https://dev.to/flamehaven01/the-schema-existed-the-model-had-no-way-to-know-3626</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Glossary: terms used in this article&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkgcyvlbbbkdknffqdil.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgkgcyvlbbbkdknffqdil.png" alt="The Engine of Memory Governance" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Loss&lt;/strong&gt;: The architectural characteristic of LLMs where no information persists between independent conversations. Not a bug. A design property with real engineering consequences for long-running projects.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Trust Class&lt;/strong&gt;: The reliability classification of a context item's source. In this series: &lt;code&gt;canonical&lt;/code&gt; (repo truth), &lt;code&gt;distilled&lt;/code&gt; (summarized from sessions), &lt;code&gt;raw&lt;/code&gt; (unprocessed session output), &lt;code&gt;symbolic&lt;/code&gt; (reference only).&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invoke Role&lt;/strong&gt;: A governance label that defines a context item's eviction behavior. In this series: &lt;code&gt;anchor&lt;/code&gt; (never evict), &lt;code&gt;bridge&lt;/code&gt; (preserve across phases), &lt;code&gt;hint&lt;/code&gt; (drop first under pressure), &lt;code&gt;none&lt;/code&gt; (drop immediately).&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invocation&lt;/strong&gt;: The mechanism by which a MICA archive reaches an AI session. Without explicit invocation, the archive exists but has no effect on the session. Formalized as a required field in v0.1.8.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Admission Gate&lt;/strong&gt;: The point at which a context item is evaluated for inclusion. Decides what goes in — before output validation begins.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl8g14yqylzyf2jo1vd0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frl8g14yqylzyf2jo1vd0.png" alt="Cover Image" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;1. What v0.0.1 Was&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11ar2cv2m4nd7pyf8sbg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F11ar2cv2m4nd7pyf8sbg.png" alt="Session loss is a structural property" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://dev.to/flamehaven01/my-llm-kept-forgetting-my-project-so-i-built-a-governance-schema-4bo6"&gt;Part 1&lt;/a&gt; established the problem: session loss is not an inconvenience. For long-running projects, it is a structural failure. The standard responses — longer prompts, RAG, session summaries, flat memory exports — all treat context as a document to be read, not a governed structure with authority levels, eviction rules, and provenance.&lt;/p&gt;

&lt;p&gt;v0.0.1 was the first attempt at a specification-level answer.&lt;/p&gt;

&lt;p&gt;It proved that context could be structured. It did not prove that the structure could govern what context is allowed to shape the session. Those are different claims. v0.0.1 satisfied the first and failed the second — in three distinct ways.&lt;/p&gt;

&lt;p&gt;Only one of those failures made the other two meaningless.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;2. A Different Layer&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j6pm407yuyjvk7sm6w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx9j6pm407yuyjvk7sm6w.png" alt="Govering inputs over parsing outputs" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While working through the problem, I was reading broadly. arXiv papers on memory management. Dev.to articles on LLM reliability. Community discussions about what people had tried and where it broke.&lt;/p&gt;

&lt;p&gt;One article in particular was useful: &lt;a href="https://dev.to/dev-in-progress/why-asking-an-llm-for-json-isnt-enough-1n8a"&gt;Why Asking an LLM for JSON Isn't Enough&lt;/a&gt;. In that article and the discussion it generated, a clear framing had emerged:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Treat the LLM as an unreliable upstream service. Add schema, validation, retry, fallback.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The operative mental model was this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;LLM → output → validate → retry → fallback → use
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This framing is correct, and useful, for what it solves: defensive parsing and output reliability. None of it addressed admission control — denying a context item before it reached the model at all.&lt;/p&gt;

&lt;p&gt;The question I needed to answer was structurally different:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context item → admitted? → trust level assigned → provenance checked → eviction priority set
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first asks "how do I handle what comes out?" The second asks "what goes in, and under what rules?"&lt;/p&gt;

&lt;p&gt;MICA is not a replacement for output validation. It is the upstream governance layer that decides what is allowed to shape the session before output validation even begins.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Three Failures in v0.0.1&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqnf2xj0wmrc8qo9wpu7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftqnf2xj0wmrc8qo9wpu7.png" alt="MICA- failures" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Failure 1: No defined semantics for scoring.&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrh3pk9s6x4wps2icsye.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flrh3pk9s6x4wps2icsye.png" alt="Failure 1" width="800" height="446"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;scoring_policy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;relevance_inputs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;similarity"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;recency"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;invoke_role"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;trust_class"&lt;/span&gt;
    &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;capsule&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;continuity&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;weight"&lt;/span&gt;
  &lt;span class="na"&gt;scoring_hints&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;canonical_memory_bonus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.15&lt;/span&gt;
    &lt;span class="na"&gt;continuity_bridge_bonus&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.05&lt;/span&gt;
    &lt;span class="na"&gt;raw_logs_penalty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.20&lt;/span&gt;
    &lt;span class="na"&gt;symbolic_only_penalty&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These were hardcoded test values. They existed to confirm that the pipeline could apply differential weights at all — not to define what those weights should be.&lt;/p&gt;

&lt;p&gt;The problem was not that the numbers were heuristic. The problem was that the heuristics had no defined semantics — no defined rule explained how they produced a score. &lt;/p&gt;

&lt;p&gt;There was no combination rule. &lt;br&gt;
No output range. &lt;br&gt;
No normalization. &lt;/p&gt;

&lt;p&gt;What &lt;code&gt;canonical_memory_bonus: 0.15&lt;/code&gt; meant relative to &lt;code&gt;raw_logs_penalty: 0.20&lt;/code&gt;, how those four numbers combined into a final score, what the result represented — none of that was specified. &lt;/p&gt;

&lt;p&gt;A conforming implementation was not possible. That means two tools claiming to implement v0.0.1 could rank the same archive differently and both still claim compliance.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Failure 2: Invariants encoded as comments, not constraints.&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fysolxvmc8or8awcvps8t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fysolxvmc8or8awcvps8t.png" alt="Failure 2" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;v0.0.1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;design_invariants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Broker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SovDef."&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;LawBinder&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;never&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;receive&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;raw&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Context&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;objects."&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Only&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;working_context&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;may&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;be&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;passed&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;to&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;LLM-facing&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;layers."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A plain list of strings. No &lt;code&gt;id&lt;/code&gt;. No &lt;code&gt;severity&lt;/code&gt;. No &lt;code&gt;track&lt;/code&gt;. They were written as constraints, but encoded as comments. The difference matters: a constraint has enforcement. A note does not. These strings could not be machine-evaluated, could not be diff-checked against session behavior, and could not be reliably extracted by a model reading the archive.&lt;/p&gt;

&lt;p&gt;What a machine-actionable invariant requires:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;design_invariants&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;INV-001&lt;/span&gt;
    &lt;span class="na"&gt;rule&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Context&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;Broker&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;must&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;generate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;SovDef."&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;critical"&lt;/span&gt;
    &lt;span class="na"&gt;enforceable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="na"&gt;track&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;compile-time"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Without &lt;code&gt;id&lt;/code&gt;, an invariant cannot be referenced in audit output. Without &lt;code&gt;severity&lt;/code&gt;, a violation cannot be triaged. Without &lt;code&gt;track&lt;/code&gt;, enforcement has no defined point of application. v0.0.1 had none of these.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Failure 3: No path to the model.&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzld30stv4kv380by6uso.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzld30stv4kv380by6uso.png" alt="Failure 3" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There was no &lt;code&gt;invocation_protocol&lt;/code&gt;. No session-start procedure. No confirmation that the archive had been loaded. The model would begin a session with no instruction to locate the archive, no way to confirm it had been read, and no defined behavior for what to do at session start.&lt;/p&gt;

&lt;p&gt;The archive existed. The model had no reliable way to know it existed.&lt;/p&gt;

&lt;p&gt;A new session could start, answer confidently, and never once acknowledge the archive it was supposed to be governed by.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;4. Why Failure 3 Made the Others Meaningless&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hho8hzc5b1y5h7j9kda.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2hho8hzc5b1y5h7j9kda.png" alt="The hierarchy of Failure" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The three failures are not equal.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Failure 1 is a calibration problem. The formula can be fixed. The weights can be justified. That work happened in subsequent versions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Failure 2 is a schema design problem. Fields can be added. Structure can be enforced. That work happened in the same range.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Failure 3 is different in kind. It is not a field that needs to be added. It is a requirement that the schema must define how it reaches the model at all. A context item classified as &lt;code&gt;anchor&lt;/code&gt; with &lt;code&gt;eviction_priority: 3&lt;/code&gt; has no enforcement if the model never sees the archive. F&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;ailures 1 and 2 describe a schema that could not govern correctly. Failure 3 describes a schema that could not govern at all.&lt;/p&gt;

&lt;p&gt;A context system does not fail only when it forgets. It also fails when it remembers without governance — and v0.0.1 could not even guarantee the second.&lt;/p&gt;

&lt;p&gt;Every version from v0.1.0 through v0.1.7 addressed the first two failures. Scoring became implementable. Invariants gained structure. Eviction became a five-phase strategy. Error handling was defined.&lt;/p&gt;

&lt;p&gt;The invocation problem was not formally addressed until v0.1.8, which introduced &lt;code&gt;invocation_protocol&lt;/code&gt; as a required field. It declares how the archive reaches an AI session, what pattern is used, and what the session opening report must contain.&lt;/p&gt;

&lt;p&gt;That is the distance between v0.0.1 and v0.1.8.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. What This Series Covers Next&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Part 1 defined the problem.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This part documented v0.0.1 — what it was, what it got wrong, and why the most fundamental failure took the longest to fix.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt; covers v0.1.0 through v0.1.5: how scoring moved from hardcoded guesses to an implementable formula, what the eviction strategy revealed about context budget assumptions, and what was still missing at v0.1.5.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The series continues only where there is something concrete to specify, test, or correct.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3m7bef0rv0lb8n4q1o28.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3m7bef0rv0lb8n4q1o28.png" alt="last thought" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>contextengineering</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>My LLM Kept Forgetting My Project. So I Built a Governance Schema.</title>
      <dc:creator>Kwansub Yun</dc:creator>
      <pubDate>Mon, 16 Mar 2026 09:19:41 +0000</pubDate>
      <link>https://dev.to/flamehaven01/my-llm-kept-forgetting-my-project-so-i-built-a-governance-schema-4bo6</link>
      <guid>https://dev.to/flamehaven01/my-llm-kept-forgetting-my-project-so-i-built-a-governance-schema-4bo6</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Disclosure: This article was written by the author with AI assistance for editing. All technical content, architecture decisions, and design rationale are the author's own.#ABotWroteThis&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Glossary: terms used in this article&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;🔸 &lt;strong&gt;MICA (Memory Invocation &amp;amp; Context Archive)&lt;/strong&gt;: A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Session Loss&lt;/strong&gt;: The architectural characteristic of LLMs where no information persists between independent conversations. Not a bug. A design property with real engineering consequences for long-running projects.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Invoke Role&lt;/strong&gt;: A governance label that defines a context item's eviction behavior. In this series: &lt;code&gt;anchor&lt;/code&gt; (never evict), &lt;code&gt;bridge&lt;/code&gt; (preserve across phases), &lt;code&gt;hint&lt;/code&gt; (drop first under pressure), &lt;code&gt;none&lt;/code&gt; (drop immediately).&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Trust Class&lt;/strong&gt;: The reliability classification of a context item's source. In this series: &lt;code&gt;canonical&lt;/code&gt; (repo truth), &lt;code&gt;distilled&lt;/code&gt; (summarized from sessions), &lt;code&gt;raw&lt;/code&gt; (unprocessed session output), &lt;code&gt;symbolic&lt;/code&gt; (reference only).&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Anchor Item&lt;/strong&gt;: A context item that cannot be dropped under any memory pressure. Eviction priority: 0. Later parts explain how this is enforced at the schema level.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Semantic Collapse&lt;/strong&gt;: A pattern introduced in later parts of this series. A JSON Schema is applied to an LLM as a runtime contract rather than as a validator.&lt;/p&gt;

&lt;p&gt;🔸 &lt;strong&gt;Fail-Closed Gate&lt;/strong&gt;: An admission rule that excludes a context item if it fails any defined threshold. No exceptions. Formalized in v0.1.7.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;1. The Problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqze5u8qphd7f6pl5vaw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyqze5u8qphd7f6pl5vaw.png" alt="The Structural Flaw" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LLMs do not retain state between sessions. Every conversation starts from zero.&lt;/p&gt;

&lt;p&gt;For a single task, that is manageable.&lt;/p&gt;

&lt;p&gt;For a project maintained across dozens of sessions, it is not.&lt;/p&gt;

&lt;p&gt;Such a project accumulates architectural decisions, non-negotiable constraints, protected files, and a decision history explaining why things are the way they are. In that setting, the lack of continuity becomes a structural failure with compounding consequences.&lt;/p&gt;

&lt;p&gt;The failure mode is insidious. The model does not fail visibly. It produces well-formed, internally consistent output. What it cannot know is which of its inferences are wrong, because the context it received was incomplete. It cannot identify what is missing. Instead, it fills the gaps silently, drawing on training data rather than on the project's actual history.&lt;/p&gt;

&lt;p&gt;Standard responses address parts of the problem. Longer system prompts, RAG pipelines, session summaries. None addresses the governance layer: which context items are authoritative, which are provisional, how they should be weighted against one another, and what happens when memory pressure forces eviction.&lt;/p&gt;

&lt;p&gt;This post documents the specification I built to address that layer.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Where This Started
&lt;/h2&gt;

&lt;p&gt;For a long time, my workaround was simple and inefficient. Copy a conversation from one AI, paste the whole thing into another, and ask the second one to analyze it. The idea was to get a fresh perspective. To avoid the first model's blind spots by bringing the work to a different context.&lt;/p&gt;

&lt;p&gt;It rarely worked cleanly. The second model would inherit the framing, the vocabulary, even the confident wrong assumptions of the first session. It would pick up the conclusion without the reasoning errors that led to it. Then build on both. This pattern is well-documented in multi-model workflows. The receiving model anchors to the original session's assumptions rather than evaluating the content independently. Researchers studying AI review and self-critique systems have noted the same anchoring effect. Same-session review produces systematically worse error detection than fresh-session review.&lt;/p&gt;

&lt;p&gt;So I kept looking for a better way to carry context across sessions.&lt;/p&gt;

&lt;p&gt;At some point, I came across Claude's memory feature. Not while looking for it specifically. While exploring settings. Claude had introduced persistent memory for paid plans in October 2025, with the stated goal that users shouldn't have to re-explain their context at the start of every session. Anthropic's own description: &lt;em&gt;"your first conversation feels like your hundredth."&lt;/em&gt; The export feature produces structured output. Categories, dates, entries, one per line.&lt;/p&gt;

&lt;p&gt;Conceptually, this is similar to Claude's Skills system: a modular, reusable layer that carries working state between sessions without rebuilding it from scratch each time. The intent is sound.&lt;/p&gt;

&lt;p&gt;I used it. It helped. But something was consistently off.&lt;/p&gt;

&lt;p&gt;The output looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[2025-02-11] - user prefers concise responses
[2025-02-11] - medical AI governance pipeline must never skip the evidence gate
[2025-02-11] - currently exploring whether to add a search endpoint
[2025-02-11] - low-N prohibition is non-negotiable for clinical outputs
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four entries. Same format. Same weight. But they are not the same kind of thing. The second and fourth are hard constraints. In a medical AI context, violating them is not a quality issue. It is a safety issue. The first is a style preference. The third is an open question that may be abandoned next week. The export format has no mechanism to express that difference.&lt;/p&gt;

&lt;p&gt;Others in the community have hit the same wall in different ways. From r/ClaudeAI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A custom MCP memory server still caused the model to skip stored context ~40% of the time&lt;/li&gt;
&lt;li&gt;&lt;em&gt;"Notes are lossy. They capture what I thought was important, not what Claude actually found important."&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern is consistent. The flat format puts the entire prioritization burden on the user.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Governance Gap
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F253u1ktrb7earwptg0bh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F253u1ktrb7earwptg0bh.png" alt="Storage is not Governance" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Anthropic labels the memory import feature experimental, with the explicit caveat that Claude may not always successfully incorporate imported memories. The export is a snapshot. What it cannot express is governance: which items are anchors, which are provisional, what survives memory pressure, and what the source of each item actually was.&lt;/p&gt;

&lt;p&gt;At the same time, I had been building &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. A static analysis tool for detecting low-quality AI-generated code. Its core premise: not all signals are equal. Some patterns are weighted more heavily. Some trigger hard blocks regardless of aggregate score. The scoring model is explicit, versioned, and auditable.&lt;/p&gt;

&lt;p&gt;The same structure was missing from context management. A flat export has no weights, no eviction rules, no provenance, no trust hierarchy. It is a list. A governance schema is something different.&lt;/p&gt;

&lt;p&gt;That gap is where MICA (Memory Invocation &amp;amp; Context Archive) started.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. The Structural Problem with LLM Context&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0icuty445mwmdnm797i0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0icuty445mwmdnm797i0.png" alt="The Danger of Silent Regression" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Language models do not retain information between sessions. This is a known architectural property, not an implementation gap. Every new conversation begins from zero.&lt;/p&gt;

&lt;p&gt;For short, self-contained tasks, this is manageable. For long-running engineering projects, it is not. Codebases under active development. Governance systems with accumulated decision history. Architectures with non-negotiable invariants. Each creates a specific and compounding failure mode.&lt;/p&gt;

&lt;p&gt;The failure is not that the model performs poorly. It is that the model performs well. Confidently. Using whatever context it was given. That context is almost always incomplete in ways the model cannot detect.&lt;/p&gt;

&lt;p&gt;Consider what a project accumulates over months of development:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type of knowledge&lt;/th&gt;
&lt;th&gt;Survives session loss?&lt;/th&gt;
&lt;th&gt;Consequence of loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Current file contents&lt;/td&gt;
&lt;td&gt;Partially (if re-provided)&lt;/td&gt;
&lt;td&gt;Recoverable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Architecture decisions and rationale&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Silent regression risk&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Constraints that are non-negotiable&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Violated without awareness&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solutions already tried and abandoned&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Repeated work, repeated failures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Trust levels of different information sources&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;All context treated equally&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model cannot distinguish between a constraint that is load-bearing and one that is provisional. It does not know which decisions have downstream dependencies. It cannot identify that a suggested refactoring was already evaluated and rej&lt;/p&gt;

&lt;p&gt;ected for a reason not present in the current context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result&lt;/strong&gt;: the model fills gaps with plausible inference. Inference drawn from training data, not from project history.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Why the Standard Fixes Don't Fully Work
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xt8o8akdr934806w5w7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3xt8o8akdr934806w5w7.png" alt="The Illusio of Current Fixes" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every developer running a long project with an LLM eventually hits the wall and builds something to address it. The community has been running these experiments for a while now.&lt;/p&gt;

&lt;p&gt;The diagram above shows the pattern. Each approach handles one or two dimensions. None handles governance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Long system prompts&lt;/strong&gt; are the first instinct. Write everything down, start every session with it. It works until maintaining the prompt becomes a second job. One developer described it precisely: "the scaffolding becomes the work." There is also a structural issue no token count fixes: the "lost in the middle" effect. Models attend well to the beginning and end of long contexts and degrade in between. An architecture constraint buried in paragraph 11 of a 3,000-token prompt may not receive adequate attention regardless of whether it fits in the window.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;RAG pipelines&lt;/strong&gt; handle knowledge injection well. They do not handle governance. Retrieving a paragraph about how a caching layer works is different from the model understanding that this specific caching layer cannot be modified without a deviation log entry. RAG provides facts. It does not provide the weight of facts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Session summaries&lt;/strong&gt; are a reasonable manual workaround and the one I used most. The problem: summarization is lossy by design.&lt;br&gt;
Summaries strip the rationale behind decisions. The model gets the conclusion without the reasoning that produced it. When a new edge case appears that challenges an earlier decision, the model has no basis to evaluate whether the constraint still holds. It just follows the summary.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flat memory exports&lt;/strong&gt; — including Claude's own export format — do carry context across sessions. That part works. What they cannot express is priority. Which items are non-negotiable. Which are provisional. Which are stale. The burden of sorting that out falls entirely on the user, every time.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The common limitation across all four: they treat context as a document to be read. None treats it as a governed structure where different items have different trust levels, different eviction protection, and different behavior when they conflict.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Looking for an Answer
&lt;/h2&gt;

&lt;p&gt;The question was whether anyone had already solved this at the specification level.&lt;/p&gt;

&lt;p&gt;The research on LLM memory management is substantial. &lt;a href="https://arxiv.org/abs/2310.08560" rel="noopener noreferrer"&gt;MemGPT (Packer et al., 2023)&lt;/a&gt; introduced the OS analogy: main context as RAM, external storage as disk. Later work extended this into hierarchical architectures and tiered storage models. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/abs/2509.18868" rel="noopener noreferrer"&gt;A 2025 survey (arXiv:2509.18868)&lt;/a&gt; maps the landscape and documents a problem the field had started to name: &lt;strong&gt;self-degradation&lt;/strong&gt;. Naive "add everything" strategies cause memory inflation. Inflation leads to error propagation. The agent performs worse over time. Not better.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The direction is clear: memory needs to be tiered, decayed, and managed. Not accumulated.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;What I could not find was a specification for the governance layer. Not for a multi-agent research system. For a single developer maintaining a real project across sessions. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How to express that some items are non-negotiable anchors and others are provisional notes. &lt;/li&gt;
&lt;li&gt;How to define eviction behavior a conforming implementation must follow. How to require provenance. &lt;/li&gt;
&lt;li&gt;How to version and audit context state the way you version and audit code.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The existing systems were engines. What was missing was a schema.&lt;/p&gt;

&lt;p&gt;That is what v0.0.1 attempted. The first version was rough. It got several things wrong. Part 2 covers what those failures revealed.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. What This Series Covers
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4of6y5vm8oi2pxdneafy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4of6y5vm8oi2pxdneafy.png" alt="Intoducing MICA" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That is the problem. MICA is the attempt at a specification-level answer to it.&lt;/p&gt;

&lt;p&gt;This is Part 1. It documents the problem and the motivation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part 2: v0.0.1. The first schema version, what it defined, what it got wrong, and why those failures revealed exactly the problem this series is trying to solve.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The series runs as long as there is something concrete to document.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Named failure mode from this post:&lt;/strong&gt; the governance gap. Context systems that store what you said, but not what it meant, where it came from, or how much it mattered.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: &lt;a href="https://flamehaven.space" rel="noopener noreferrer"&gt;flamehaven.space&lt;/a&gt;. Open-source tooling: &lt;a href="https://github.com/Flamehaven01/AI-SLOP-Detector" rel="noopener noreferrer"&gt;AI-SLOP-Detector&lt;/a&gt;. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>contextengineering</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
