DEV Community: Kwansub Yun

Making Equation (2.2) of the OpenAI Erdős Result Executable

Kwansub Yun — Tue, 26 May 2026 06:37:10 +0000

Why a proved theorem still needs reproducible claim custody

On May 20, 2026, OpenAI announced that an internal reasoning model had produced a counterexample to the Erdős planar unit-distance conjecture.

The problem is easy to state: given $n$ points in the plane, how many pairs of points can be exactly distance $1$ apart?

For nearly eighty years, the prevailing expectation was that square-grid-type constructions were essentially optimal up to a slowly growing exponent. OpenAI’s announcement changed that. Its internal reasoning model produced an infinite family of examples giving a polynomial improvement, and the proof was checked and written up in mathematical form by external mathematicians.

In this article, “the remarks paper” refers to the companion PDF by Alon, Bloom, Gowers, Litt, Sawin, Shankar, Tsimerman, Wang, and Matchett Wood, linked from OpenAI’s announcement.

The proof-level result belongs to those authors and the source papers.

My focus here is narrower: equation (2.2) in that remarks paper, and whether its explicit numerical value can be reproduced as executable code.

This is not about proving the theorem again. It is about what happens after a theorem contains a fragile numerical claim.

The proof is not the artifact

A mathematical proof and a software artifact do different jobs.

The proof establishes the theorem. It gives the definitions, the argument, the dependencies, and the mathematical reason why the result holds.

A software artifact should not pretend to replace that.

But some claims inside a mathematical paper have a finite, numerical, or computationally checkable surface. Those claims can be preserved differently. They can be run. They can be tested. They can fail when precision is wrong.

That is the narrow role of an executable reproduction artifact: not proof replacement, not automated peer review, and not authority over the theorem, but a reproducible object for the part of the claim that can be computed.

The specific target: equation (2.2)

In the OpenAI Erdős result, one checkable surface is equation (2.2) of the remarks paper.

For the explicit choice

the remarks paper gives an explicit numerical lower bound on the exponent excess above the classical Erdős exponent:

These parameters are taken directly from the remarks paper without modification. The artifact does not derive the multiquadratic choice; it reproduces the finite numerical calculation built from that choice.

This is not the later stronger explicit bound associated with Sawin’s separate preprint. It is not $\delta \approx 0.014$. It is the numerical value appearing in equation (2.2) of the remarks paper.

That narrowness is important. It is exactly what makes the claim suitable for executable reproduction.

Where the numerical fragility comes from

The numerical fragility comes from the exact form of equation (2.2), not from a large computation.

Immediately after the published expression, the parameters are:

and

With the paper’s definitions of $u, v$, and $\delta$ substituted into equation (2.2), the exponent excess reduces to:

The constant $36$ is not introduced by the implementation. It is already present in the remarks paper’s equation (2.2), both in the numerator term $u\pi/(36v)$ and in the denominator term $\log(36/\delta^2).$

After substituting $u = K/r^2, v = r/2$, and $\delta = 101^{-2K}$, the numerator simplifies to $\log(K\pi / 18r^3)$, while the denominator becomes $\log 36 + 4K \log 101$.

Here the $101$ comes from the finite prime in $S = {101, \infty}$.

In other words, this artifact does not derive the constant $36$ from first principles; it reproduces the published equation with the stated substitutions.

The precision problem is in the numerator:

Because $K$ is the ceiling of $18r^3 / \pi$, the ratio $K\pi / 18r^3$ is only barely larger than $1$.

More precisely:

For $r = 510510$,

So the numerator is effectively $\log(1 + \varepsilon)$ with $\varepsilon$ at the $10^{-18}$scale.

IEEE 754 double precision has machine epsilon around $2.2 \times 10^{-16}$. A naive float64 computation therefore cannot reliably distinguish the near-one ratio from $1$. The ratio rounds to $1$, leading to $\log(1) = 0.$

The exponent excess disappears before the computation reaches the value stated in the paper.

This is not a flaw in the mathematics. It is a precision failure in the numerical evaluation of a valid expression. That is the reason the artifact evaluates equation (2.2) using mpmath at 200-bit precision.

A PDF can state the value. A verifier can expose when the value disappears.

What we built

We built:

https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction

The purpose is deliberately narrow: reproduce the finite, explicitly checkable numerical surface of equation (2.2) in the OpenAI Erdős unit-distance disproof remarks.

The package evaluates the expression using mpmath at 200-bit precision and returns:

6.2391e-38

This matches the published two-significant-figure value $\approx 6.24 \times 10^{-38}$ to $1.4 \times 10^{-4}$ relative error.

The repository includes 60 unit tests, 21 verifier checks, a frozen per-source-file SHA-256 manifest, GitHub Actions CI across Ubuntu and Windows, Python 3.11 / 3.12 verification, and a frozen-report mode that prints a verdict without mutating tracked evidence.

The basic reproduction path is:

git clone <https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction>
cd openai-erdos-eq22-reproduction
pip install -e ".[dev]"
python -m erdos_ant.verify

Expected output includes:

Verdict: PASS
Checks: 21/21 passed
eq (2.2) exponent excess: 6.2391e-38

This is not a large system. That is part of the point. A small claim with a clear boundary is easier to inspect than a broad claim that blurs proof, computation, and interpretation.

From reproduction to custody

This repository was not built as a one-off reaction to an OpenAI announcement. We are not announcing a grand framework here; we are showing the discipline in miniature.

For us, the work is part of a longer routine: take a mathematical or technical claim, isolate the checkable surface, pin the environment, and make drift visible.

That is intentionally plain work.

Read the source.

Extract the claim.

Reproduce the computation.

Record the boundary.

Let the verifier fail if the result disappears.

To execute this routine reliably, the scope must be uncomfortably narrow. This repository intentionally leaves the proof of Theorem 1.1, the construction of the infinite tower, and Sawin’s separate $\delta \approx 0.014$ preprint to their respective sources. It does not pretend to be peer review.

This is not just a disclaimer. It is the point of the artifact.

A sharp, restricted boundary is exactly what makes a claim inspectable, repeatable, and challengeable. This is what I mean here by claim custody.

It addresses a technical governance question, but not in the policy sense: what exactly is being trusted, from which source, and what makes the claim fail if the implementation changes?

A PDF can state the value. A verifier can expose when the value disappears.

We claim no authority over the broader theorem. We simply maintain a reproducible boundary around the fragile numerical claim inside it.

Closing

The theorem was proved in the mathematical papers.

This repository asks a smaller question: can the numerical value in equation (2.2) survive execution?

In float64, it does not. The exponent excess collapses to zero.

At 200-bit precision, with the source parameters pinned and the verifier running under CI, the artifact recovers:

6.2391e-38

matching the published value to $1.4 \times 10^{-4}$ relative error.

That is the point.

Not a new theorem. Not a proof replacement.

A reproducible claim surface for one precision-sensitive number in a major AI-assisted mathematical result.

Repository:

https://github.com/Flamehaven-Labs/openai-erdos-eq22-reproduction

Paper / Zenodo:

https://doi.org/10.5281/zenodo.20383217

The README Was a Protocol. The Entrypoint Was Still Optional.

Kwansub Yun — Thu, 21 May 2026 10:34:02 +0000

Glossary: terms used in this article

🔸 MICA (Memory Invocation & Context Archive): A governance schema for AI context management. Defines how context should be structured, trusted, scored, and handed off across sessions.

🔸 Invocation Hierarchy: The operational ladder — natural, guided, forced — that determines how MICA actually reaches a live session.

🔸 Activation Packet: The compiled session-start object that declares read targets, load state, self-test posture, drift status, and gate outcome.

🔸 Session Report: The structured opening output that declares what was loaded, what the self-test found, and whether the session gate is open.

🔸 README-as-Protocol: The pattern where the model's natural tendency to read the README first is formalized as a declared invocation mechanism. Introduced in v0.1.8.

1. Where Part 6 Left Off

Part 6 showed what MICA looks like inside a single maintenance agent — session report, drift detection, design invariants, deviation log. The structure held. The protocol ran.

Part 6 ended with a harder question: what happens when accumulated session knowledge needs to govern the next session — inside a tool that runs within AI workflows itself?

The answer depends on a prior question: does the next session actually load what was accumulated?

That is not a schema problem. It is an entrypoint problem.

2. The Gap README-as-Protocol Left Open

Part 4 made a specific assumption: in many repository-based AI workflows, the README is already the model's first orientation surface.

That observation became README-as-Protocol.

Instead of inventing a new installation mechanism, MICA formalized an existing behavior: the model reads the README, the README points to the archive, and the session is expected to load context, run checks, and report readiness before work begins.

That assumption was useful.

It gave MICA a path into the session without requiring plugins, services, or custom host infrastructure.

But a protocol is not an entrypoint.

The README can declare where the archive is, what invariants matter, what the session report must contain. None of that guarantees sequencing. A model can still skim the README, jump directly into code, or begin work before declaring its load state.

A gate without a consequence is still only etiquette.

That is the gap this version had to close.

3. The Answer: An Invocation Hierarchy

MICA does not auto-invoke by magic. If no human, host, wrapper, or launcher calls the memory contract, the archive can exist without governing anything. This is the same truth Part 2 identified: the structure can exist, and the model can still have no reliable way to know it exists.

The answer is an explicit hierarchy.

Natural — the model reads the project surface voluntarily: README, mica.yaml, archive JSON, playbook. No intervention required.

Guided — a host agent requests the activation packet before work begins. The packet declares read targets, self-test posture, drift state, and gate outcome. The host uses it to preflight the session.

Forced — a launcher blocks repository work until the session report clears. This is the strongest path and the least elegant one. It is also the one that survives noisy real-world terminal workflows.

4. What Changed in Code

Three concrete moves made this operational.

Session report became a real runtime output.

parser.add_argument(
    "--format",
    choices=["text", "json", "hook", "session-report"],
    default="text"
)

The opening report is now a compiled object — not a protocol expectation, not a prose description. A host can consume it directly.

Invocation is now compiled, not described.

mica_invoke.py compiles read targets and session report into one activation packet:

packet = {
    "mode": mode,
    "entry_strategy": mode,
    "read_targets": _layer_targets(project_root),
    "session_report": report,
}

This is the shift from documentation-first startup to packet-first startup. The host no longer has to infer the sequence from prose.

In guided mode, the output is already shaped for host consumption:

{
  "mode": "guided",
  "entry_strategy": "guided",
  "read_targets": [
    {"name": "readme", "...": "..."},
    {"name": "mica_yaml", "...": "..."},
    {"name": "archive", "...": "..."},
    {"name": "playbook", "...": "..."},
    {"name": "lessons", "...": "..."}
  ],
  "session_report": {
    "archive_version": "1.7.8",
    "self_test": {"pct": "CLOSED", "closed_contract": true},
    "drift_status": {"status": "NO_DRIFT"},
    "gate": "PASS"
  },
  "directive": "Host agent should load declared MICA surfaces first and use the session report as opening state."
}

Forced mode now has consequence.

if args.mode == "forced" and packet["session_report"].get("gate") == "BLOCKED":
    sys.exit(1)

The simplest entry surface:

@echo off
python "%~dp0tools\mica_invoke.py" %*

That wrapper gives MICA an enforceable terminal entrypoint instead of relying on good behavior.

5. STEM-BIO-AI: The Cleaner Case

STEM-BIO-AI already had a mature MICA memory layer — archive, playbook, lessons, invocation protocol, drift profile. What changed was not the memory model. It was how that model becomes operative before work begins.

That difference is visible across all three invocation modes.

In natural mode, the helper preserves the README-first path and makes the expected read order explicit:

[MICA INVOKE] mode=natural
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Prefer reading README first, then load mica.yaml, archive, and playbook before scan work.

In guided mode, the same startup becomes a host-consumable packet:

{
  "mode": "guided",
  "read_targets": ["readme", "mica_yaml", "archive", "playbook", "lessons"],
  "session_report": {
    "archive_version": "1.7.8",
    "self_test": {"pct": "CLOSED", "closed_contract": true},
    "drift_status": {"status": "NO_DRIFT"},
    "gate": "PASS"
  }
}

In forced mode, the launcher uses the same contract as a gate:

[MICA INVOKE] mode=forced
Gate       : PASS
State      : INVOCATION_MODE
PCT        : CLOSED
...
Directive: Block work until the session report gate is not BLOCKED.

The session report now looks like this:

[SESSION READY]
Archive: 1.7.8
Load: {"state": "INVOCATION_MODE", "mica_yaml": "memory\\mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 15, "high_count": 3}
Gate: PASS

Before, the package told the operator how to start correctly. Now, the session declares whether it actually did.

Before this version, starting a STEM-BIO-AI session correctly still depended on the operator remembering to load the right memory surfaces in the right order. Now that dependency can move upward: in guided mode to the host, and in forced mode to the launcher.

6. CCGE: The Harder Case

CCGE is more important precisely because it is harder. It is already a governance-heavy runtime. If MICA's identity were weak, it would disappear into the larger framework.

CCGE here is the Care Chain Governance Engine: a fail-closed clinical governance runtime with its own execution core, artifact generation, policy layers, and approval logic. That is why it is the harder case. MICA is not being tested in isolation. It is being tested inside a system dense enough to swallow it.

It did not.

The boundary stayed explicit:

MICA = invocation, memory, invariants, drift control
CCGE Core = fail-closed runtime and artifact generation
STEM-AI = trust re-audit and classification

That is the important architectural result. In STEM-BIO-AI, MICA is already close to the center of the tool's operational identity. In CCGE, MICA has to retain its own identity inside a much larger runtime. It does so by remaining responsible for invocation, memory, invariants, and drift control, while CCGE Core remains responsible for fail-closed execution and artifact logic.

The current session report in CCGE:

[SESSION READY]
Archive: None
Load: {"state": "INVOCATION_MODE", "mica_yaml": "mica.yaml"}
Self-test: {"pct": "CLOSED", "closed_contract": true}
Drift: {"status": "NO_DRIFT"}
Active invariants: {"critical_count": 0, "high_count": 0}
Gate: PASS

Archive: None with Gate: PASS is not a contradiction. The baseline archive does not yet expose a project.version field. MICA detected that gap and reported it before any work began. A system that hides its own incompleteness is not governed. A system that surfaces it at session start is.

The reason is concrete: the active archive is still a baseline integration memory object, not yet a fully target-bound archive. Its project block still carries placeholders like:

"project": {
  "name": "<target-repo-name>",
  "path": "<absolute-or-repo-relative-path>",
  "owner": "<org-or-maintainer>",
  "integration_program": "CCGE Unified Model",
  "target_status": "phase_1_candidate"
}

So the current report is telling the truth about what exists: a coherent MICA package around a still-baseline archive.

A README might have let that gap stay invisible. The session report surfaced it immediately. That is what honest governance looks like before an archive is fully populated.

7. What This Means for Anyone Building Agent Workflows

Three lessons from running this against two different projects.

Human-readable startup is not enough. If the only valid path lives in a README, the protocol is vulnerable to partial reading and host variance. STEM-BIO-AI is the clean example here: the memory layer was already mature, but correct startup still depended too much on the operator remembering to load it.

Session-start state must be machine-usable. If a host agent cannot consume the startup declaration as a structured object, it cannot reliably preflight the session. That is why guided mode matters more than another explanatory document: it gives the host an object to act on, not just instructions to interpret.

A gate needs an entrypoint. A session report can be a conceptual hard gate, but until a launcher or host uses it as an entry condition, it remains a convention. CCGE is the stronger proof of that point because the environment is already dense with governance logic; without an explicit entry surface, MICA would have been easy to blur into the surrounding framework instead of remaining its own startup layer.

8. What This Does Not Claim

MICA does not self-invoke automatically in all environments. There is still no natural law that forces an LLM session to load the governed archive first.

The real claim is narrower:

MICA can now be read naturally
MICA can now be requested deliberately
MICA can now be enforced mechanically

Not total automation. A realistic path to enforceable startup.

9. What Part 8 Will Address

The startup path is now much stronger.

But one question remains:

How much of the session-start contract should be owned by the archive itself, and how much should remain a runtime default?

The current line can emit session-report, compile guided packets, and block in forced mode. The next step is stricter archive ownership — richer session_report_format, explicit per-archive session_gate_policy, better drift contracts.

Part 8 is not about whether MICA should govern startup. It already does. It is about how much of that behavior should be declared by the archive rather than inferred by the runtime.

The series continues only where there is something concrete to specify, test, or correct.

Named decision from this post: A protocol is not yet an entrypoint. MICA becomes operational only when invocation is structured as natural, guided, or forced — and the session begins from a declared activation packet, not from hope.

MICA is part of the Flamehaven governance-first AI systems practice. Schema, technical report, and production instance: flamehaven.space. Open-source tooling: AI-SLOP-Detector. All schema references follow the v0.1.8.1 Universal standard unless a specific earlier version is named.

From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

Kwansub Yun — Tue, 19 May 2026 14:38:53 +0000

Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.

The last time I wrote about STEM BIO-AI, the focus was AIRI:

how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.

That was the right story for 1.7.0 and 1.7.1.

But the project changed meaningfully after that.

For readers who have not followed the earlier posts: Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x

By 1.7.8, the interesting question was no longer just:

Can this scanner attach a broader risk language to local findings?

It became:

Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?

That shift matters.

Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see why the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.

From 1.7.6 through 1.7.8, three changes mattered most.

They changed:

what counts as evidence,
how warning lanes are separated,
and how the final artifact stays legible across surfaces.

This is the more technical story behind those releases.

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

Before getting into the release details, it helps to define what AIRI means in this series.

AIRI refers here to the MIT AI Risk Repository: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.

The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.

That makes AIRI useful as a vocabulary source.

But vocabulary is not truth.

A local scanner should not say:

this repository caused this risk.

It should say something more careful:

this local finding belongs to a broader class of AI risk language.

That distinction is the design boundary.

1. Problem: The scanner was still too Python-shaped

One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.

That is not a philosophical problem.
That is an implementation problem.

In practice, the project was still too biased toward Python-native signals.

That showed up most clearly in JavaScript or mixed-stack repositories:

package.json
package-lock.json
pnpm-lock.yaml
yarn.lock
npm-shrinkwrap.json

were not being treated as first-class provenance and replication evidence in the same way that requirements.txt or pyproject.toml were.

The result was a false negative pattern:

Stage 3 provenance (B1) could be undercounted
Stage 4 replication evidence could be undercounted
and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure

That kind of miss is more dangerous than it sounds.

Not because it makes the score a little wrong.

But because it damages trust in the scanner's worldview.

If developers see a tool miss an obvious pnpm-lock.yaml, they stop believing the harder claims too.

What changed in `1.7.6`

The fix was straightforward but important:

JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.

Concretely, that meant:

B1_data_provenance_controls started recognizing JS manifest/lock surfaces
S4_environment_lock_evidence started recognizing them
S4_exact_dependency_pins_or_hashes started recognizing them

This was not a scoring philosophy change.

It was a scope correction.

The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.

One boundary matters here.

B1_data_provenance_controls does not suddenly mean "dataset lineage was proven by a lockfile."

In this lane, B1 is using dependency manifests as repository provenance surfaces:

what environment the repository expects,
what dependency custody the repository exposes,
and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.

That is weaker than dataset lineage evidence.

But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.

A small before/after that makes the point

The yorkeccak/bio case is a good example because the score movement was not philosophical. It was mechanical.

Before the JS manifest fix, the same repository could produce:

version: 1.7.5
final_score: 40
stage_3_code_bio: 6
B1_data_provenance_controls: 0 / 15
replication_score: 10
AIRI covered_count: 0 / 31

After the manifest and lockfile correction, the same repository shape produced:

version: 1.7.8
final_score: 48
stage_3_code_bio: 25
B1_data_provenance_controls: 15 / 15
replication_score: 30
AIRI covered_count: 7 / 32

The important part is not the score delta by itself.

One small boundary is worth making explicit here.

The AIRI change is doing two things at once:

the denominator moved from 31 to 32 because the governed AIRI detector-scope expanded by one mapping row across this release line,
and the numerator moved from 0 to 7 because the current release can now carry more bounded AIRI links around the findings it actually surfaced.

That explains the AIRI coverage delta.

The scoring delta came from a more mechanical correction:

package.json, package-lock.json, and pnpm-lock.yaml stopped being invisible,
Stage 3 stopped saying "no dependency/provenance manifest detected,"
and Stage 4 stopped undercounting replication structure that was obviously there.

That is what I mean by "blind spot removal" rather than score drift.

Why that matters

This is the kind of change that sounds small in a changelog but large in practice.

Because it changes the relationship between the tool and the developer reading it.

A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.

That correction also made later report outputs more believable.

When B1 moved from 0 to 15 in affected repositories, that was not "score drift." It was the removal of a blind spot.

And that distinction is exactly why audit tools need explicit versioned rationale.

Without it, every score movement looks arbitrary.

2. Problem: The warning lanes were doing too many jobs at once

Before the split, it helps to read C1–C6 as code-integrity lanes.

They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.

Lane	What it means in STEM BIO-AI	What a reviewer should inspect
`C1`	Hardcoded credential signals	exposed API keys, cloud keys, tokens, or credential-like patterns
`C2`	Dependency pinning and external-service fragility	loose dependency ranges, missing exact pins, fragile external service assumptions
`C3`	Deprecated patient-adjacent paths	legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns
`C4`	Fail-open exception handling	`except: pass`, `except Exception: pass`, silent fallbacks, or code paths where errors can disappear
`C5`	Compliance and clinical-boundary integrity	unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims
`C6`	Mock-auth or no-auth local/self-host trust boundaries	auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions

That table matters because C4, C5, and C6 are not interchangeable.

A fail-open exception is not the same problem as an unsupported compliance claim.

And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.

That distinction became important once the report started surfacing more nuanced governance signals.

The old C4 lane had started life as a code-oriented fail-open/exception surface.

But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.

That made the result harder to read.

If a reviewer sees:

C4_exception_handling_clinical_adjacent_paths: WARN

they should be able to infer the remediation class immediately.

They should know to inspect executable control flow.

They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.

Once one lane starts carrying all of those meanings, the ID stops doing its job.

This is a common failure mode in rule systems.

At first it feels efficient:

one warning lane,
one bucket,
multiple related issues.

Then a few releases later the bucket becomes a junk drawer.

That is exactly what had to be prevented here.

What changed in `1.7.7` and `1.7.8`

The solution was to split the lane cleanly:

C4 stayed reserved for executable fail-open exception behavior
C5 was introduced for unsupported compliance or boundary-integrity claims
C6 was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals

This was more than renaming.

It made the model of the problem cleaner:

C4 is code-path failure semantics
C5 is governance/claim integrity
C6 is trust-boundary collapse in local or self-host flows

That distinction matters to developers because those are different remediation classes.

If a repository triggers C4, you inspect executable control flow.
If it triggers C5, you inspect public claim surfaces and supporting governance evidence.
If it triggers C6, you inspect local auth and trust-boundary design.

One warning label should not try to be all three.

The more interesting case is when two of those lanes fire together.

A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.

That is not one problem.

It is two related problems:

C5 says the claim surface is overstating governance integrity
C6 says the local trust boundary is weaker than the claim suggests

That is exactly why the split matters.

If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.

If they stay separate, the report can say:

the public claim is weak,
the local boundary is weak,
and both together make the repository easier to over-trust.

The code insight

This is one of those places where good audit tooling starts looking more like good static analysis design.

A useful warning family is not just one that catches things.

It is one that stays semantically stable across releases.

That is why this split mattered:

it was not just about improving recall.

It was about preserving interpretability under growth.

Once a detector ID becomes ambiguous, your historical comparisons become weaker.

And once historical comparisons become weaker, your audit system starts losing its memory.

That is a bigger problem than one missed warning.

3. Problem: The report could still be correct and yet hard to trust

A repository scanner does not end its life in JSON.

It ends up in:

Markdown
HTML
PDF
demos
governance reviews
screenshots
and social arguments

That means the output architecture matters almost as much as the scoring logic.

And there were two places where this became obvious.

First: AIRI numbers needed explanation, not just display

Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.

That is a problem.

Because a number like 7 / 32 looks precise.

But precision without causal explanation is fragile.

Developers do not just want to know that a risk mapped.
They want to know:

which detector triggered it,
why that detector maps to that AIRI risk,
and what boundary still remains around that mapping.

So the AIRI layer had to become more explicit.

That is where mapping_details mattered.

Covered AIRI rows now carry bounded reasoning objects that can say, in effect:

detector ID
mapping justification
trigger reason

That is a much stronger artifact than a bare coverage count.

It turns AIRI from a visual add-on into an inspectable vocabulary layer.

In practice the object now looks more like this:

{
  "id": "24.01.03",
  "title": "Safe exploration problem with widely deployed AI assistants",
  "covered_by": ["C5_compliance_boundary_integrity"],
  "mapping_details": [
    {
      "detector_id": "C5_compliance_boundary_integrity",
      "mapping_justification": "Weak compliance and clinical-boundary integrity can cause users to over-trust unsafe exploration in clinical-adjacent contexts.",
      "trigger_reason": "Unsupported legal/compliance claim surfaced in boundary-integrity lane."
    }
  ]
}

That matters because the AIRI layer no longer asks the reviewer to trust a number alone.

It now gives the reviewer a bounded reasoning object to inspect.

Second: The packets themselves needed re-architecture

The PDF tiers had also drifted into an awkward shape.

The old packet boundaries were no longer matching the actual content density:

Stage 4 could disappear or feel collapsed
the closeout pages could become overcrowded
and "5-page detailed packet" could stop meaning what users expected

That led to a cleaner packet model:

level 1 = brief 1p
level 2 = standard 5p
level 3 = full 7p

And just as importantly:

the default CLI path moved to level 3

It is a statement about what the project now considers the normal artifact.

The normal artifact is no longer the brief scan.
It is the full evidence packet.

Why that matters

This is where the project moved from "scanner" toward "audit architecture."

A scanner can stop at a result.

An audit architecture has to preserve meaning across surfaces.

That means:

JSON must be canonical
HTML must be navigable
PDFs must honor real packet boundaries
and the same warning semantics must survive in all of them

That is why these changes matter to developers.

They are part of the correctness story.

If the why disappears when the result becomes a report, the audit object was never complete to begin with.

The hidden pattern behind all three changes

These releases can look like a mixed bag:

JS manifest support
legal/compliance claim surfacing
external dependency risk
C4/C5/C6 split
AIRI reasoning
packet restructuring
demo/output alignment

But there is a single pattern underneath them:

the system became less willing to let ambiguity hide inside a convenient surface.

That showed up in three ways:

a manifest should count if it exists
a warning lane should mean one thing
a risk mapping should explain itself

That may sound almost obvious.

But a lot of tools never make it that far.

They accumulate clever features faster than they reduce ambiguity.

This line of work did the opposite.

It made *the system stricter about what its outputs are allowed to imply.
*
That is a more durable path.

The more interesting lesson

The most useful thing about 1.7.6 through 1.7.8 is not that STEM BIO-AI became "smarter."

It is that it became harder to misread.

That is a better goal for audit tooling.

Especially now.

Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.

It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.

That is where this release line got better.

Not by pretending to know more than it does.

But by making its own boundaries clearer.

What I would tell developers evaluating this line

If you only look at the release notes, you might think:

better AIRI
more warnings
nicer reports

That is true, but too shallow.

The real changes are:

the scanner is less Python-centric than it was
the warning taxonomy is more semantically stable than it was
the artifacts are more inspectable than they were

That combination matters more than any one score change.

It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.

That is the direction I care about.

Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.

The system has to show its work.

These versions got noticeably better at doing that.

Try It

pip install stem-ai
stem /path/to/repo

If you want the full packet explicitly:

stem scan /path/to/repo --level 3 --format all --explain

The default path now lands on the full evidence packet, and that is the point.

In audit tooling, the serious path should not require an extra flag.

See the Artifact

If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:

Interactive HTML report: Open interactive HTML report
Full 7p PDF packet: Open full 7p PDF packet

The point of 1.7.8 is not just that the scanner scores the repository differently.

It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.

Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x

Kwansub Yun — Thu, 14 May 2026 13:41:43 +0000

This is the second half of the same 1.7.x transition.

In the previous post, I wrote about calibration governance: how STEM BIO-AI keeps score authority from drifting when users simulate policy posture.

That was about how the system decides.

This post is about a different layer:

how the system speaks about risk.

A local repository scanner can become trapped inside its own vocabulary.

It can detect dependency issues, weak provenance language, shallow validation, reproducibility gaps, and risky exception handling.

But if every finding stays only inside the scanner's internal language, the report may remain too narrow.

That is the problem AIRI helped address in STEM BIO-AI 1.7.x.

In this context, AIRI is used as a local risk-vocabulary layer built from the MIT AI Risk Repository ecosystem.

The point is not to replace deterministic repository scanning with an external risk database.

The point is to give local findings a broader risk vocabulary without turning that vocabulary into a truth claim.

Basic AIRI Context

The MIT AI Risk Repository is a public AI risk resource from the MIT AI Risk Initiative.

It helps organize fragmented AI risk language across research, policy, and industry sources.

The repository includes three main parts:

an AI Risk Database
a Causal Taxonomy of AI Risks
a Domain Taxonomy of AI Risks

According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications. The public domain taxonomy organizes risks into 7 domains and 24 subdomains.

Some of those domain taxonomy nodes include:

2. Privacy & Security
2.1 Compromise of privacy by obtaining, leaking or correctly inferring sensitive information
2.2 AI system security vulnerabilities and attacks
6.5 Governance failure
7. AI System Safety, Failures, & Limitations
7.3 Lack of capability or robustness
7.4 Lack of transparency or interpretability

That makes AIRI useful as a vocabulary source.

But vocabulary is not truth.

A local scanner should not say:

this repository caused this risk.

It should say something more careful:

this local finding belongs to a broader class of AI risk language.

That distinction is the design boundary.

What Problem AIRI Was Meant to Solve

STEM BIO-AI began as a deterministic evidence-surface scanner for bio and medical AI repositories.

That core remains.

The scanner looks at observable repository surfaces:

README and docs
code structure
CI configuration
dependency manifests
changelogs
reproducibility signals
clinical-adjacent boundary language

But once STEM BIO-AI started producing richer audit outputs, a new question appeared:

How should the system talk about the broader risk territory around a detected finding?

For example:

a fail-open exception path may have implications beyond code quality
weak provenance language may connect to reproducibility and trust concerns
shallow validation around sensitive inputs may point toward a wider harm surface than the repository alone makes obvious

Without a broader vocabulary, those findings remain local and narrow.

AIRI helps widen the vocabulary without making the scanner less deterministic.

A Short Note on Detector Families

In this article, a detector family means a bounded local analysis surface inside STEM BIO-AI.

It does not mean an AI model judging the repository.

Examples include:

code integrity detectors such as hardcoded credential or fail-open exception checks
AST contract detectors such as shallow validator checks
bio diagnostics such as SMILES parser-guard or silent mock fallback checks
provenance and reproducibility evidence surfaces

A detector family produces a local finding.

The AIRI layer does not replace that finding.

It gives the finding a broader vocabulary anchor.

AIRI Does Not Replace the Scan

This boundary matters.

The AIRI layer does not:

validate that a real-world incident happened
prove that a repository causes a given harm
turn a detector hit into a clinical danger claim
replace due diligence or domain review
override the deterministic score

Instead, it gives the system a structured way to say:

what broader risk territory a finding may relate to
which risk vocabulary exists around that class of concern
where known coverage gaps remain

That is why AIRI is a risk-vocabulary layer, not a truth layer.

If a report says something like:

covered risks: 12 / 31

that should not be read as:

the repository is 38% safe

or:

the scanner covers 38% of all AI risk

A better interpretation is:

within the detector scope currently mapped into the curated AIRI runtime layer, this scan triggered findings that connect to these AIRI risk entries.

That is narrower.

It is also more useful.

From External Repository to Local Governance Layer

The AIRI story in STEM BIO-AI changed during 1.7.x.

The initial direction was simple: use AIRI to provide broader risk labels around local findings.

That was useful, but not enough.

If an audit system relies on an external risk source, it needs governance around that source.

So STEM BIO-AI separates AIRI into three local layers:

Local layer	Purpose
`airi_registry_full.v1.json`	normalized full local registry derived from the upstream AIRI snapshot
`airi_runtime_bundle.v1.json`	curated runtime subset used by deterministic scans
`airi_detector_mapping.v1.json`	detector-to-risk mapping registry plus known-gap records

This separation prevents a common mistake:

confusing the full upstream AIRI universe with the smaller curated runtime bundle used by the scanner.

The scanner uses the curated runtime bundle, not the entire upstream AIRI universe.

That keeps runtime outputs deterministic, reviewable, and tied to a known local snapshot.

What “Governed” Means Here

In the current 1.7.5 state of the 1.7.x line, governed does not mean that every mapping has gone through an external review board.

It means something narrower and more concrete:

AIRI data is stored as versioned local artifacts
runtime scan output uses a curated bundle, not the entire upstream universe
detector mappings are separated from the full registry
known gaps are recorded as part of the mapping layer
artifact metadata surfaces AIRI registry, bundle, mapping, snapshot, and license information
changes to registry, runtime bundle, or mapping versions require explicit version bumps

That is the current governance level.

It is not final.

But it is stronger than attaching a risk dataset as an unversioned appendix.

The Curation Logic

This is the part that matters most.

AIRI is broad. STEM BIO-AI is narrow.

STEM BIO-AI does not need every AIRI entry active at runtime. It needs the subset that can be responsibly connected to deterministic repository evidence.

So the runtime bundle is curated by exclusion as much as inclusion.

A risk vocabulary node should stay outside the runtime bundle when:

No local evidence surface exists
The scanner has no repository-level signal that can responsibly connect to that risk.
The mapping would require causal inference
The scanner would have to imply that harm occurred, that users were affected, or that the repository caused a risk.
The risk is too broad for repository-local evidence
Broad societal, geopolitical, or macroeconomic risks may be important in AIRI, but they should not become runtime scan outputs unless a local detector surface can support the mapping.
The mapping would confuse vocabulary with score authority
If a risk label might be read as changing the formal score or certifying danger, it should remain outside the runtime layer until the reporting semantics are clear.

So the runtime bundle is not a summary of all AI risk.

It is the subset of risk vocabulary that the scanner can use responsibly.

Example: Detector Hit to AIRI Domain Vocabulary

A concrete example helps.

Suppose STEM BIO-AI detects a shallow validator around sensitive or clinical-adjacent inputs.

The local finding might be:

CC3_shallow_validator:
validate_* or check_* function uses only length checks without structural validation.

At the repository level, this is a code-contract finding.

It says:

the function appears to validate input
the validation is shallow
the implementation may not enforce the boundary implied by its name

The AIRI layer should not turn that into:

this repository caused privacy harm.

That would be too strong.

A safer mapping uses AIRI as vocabulary:

Local detector surface	Local meaning	AIRI vocabulary anchor
`CC3_shallow_validator`	validation function appears shallower than its name implies	`7.3 Lack of capability or robustness`; possibly `2.1 Compromise of privacy...` if sensitive information handling is in scope
fail-open exception path	code path may silently continue after failure	`7.3 Lack of capability or robustness`
hardcoded credential signal	repository surface suggests exposed secret-like pattern	`2.2 AI system security vulnerabilities and attacks`
weak provenance surface	repository gives weak evidence about data/source traceability	`7.4 Lack of transparency or interpretability`; possibly `6.5 Governance failure`
silent mock fallback	production-like path may fall back to simulated behavior	`7.3 Lack of capability or robustness`; `7.4 Lack of transparency or interpretability`

The mapping does not prove harm.

It tells the reviewer which broader AIRI vocabulary may be relevant to the local finding.

That is the difference between:

this detector proves a risk occurred

and:

this detector finding belongs near this risk-language area.

The second claim is weaker.

It is also the correct claim.

Why Local Provenance Matters

AIRI is external.

That means STEM BIO-AI needs to answer governance questions explicitly:

which upstream snapshot is being used?
which subset is active at runtime?
which risks are included in the curated bundle?
which risks are known gaps?
which detector maps to which AIRI entry?
what version of the mapping is active?

This is why the AIRI work matters.

It is not just adding labels.

It is turning risk vocabulary into a governed local data layer.

In the current governance note, the upstream source is recorded as:

upstream source: https://airisk.mit.edu/
upstream artifact: The AI Risk Repository V4_03
upstream license: MIT
local snapshot date: 2026-04-23

That provenance is not cosmetic.

It allows an audit artifact to say which risk vocabulary it was using when the scan was produced.

What Is Implemented in the Current 1.7.5 State of 1.7.x

The current AIRI layer is implemented, but bounded.

Implemented surfaces include:

AIRI-backed coverage surfaces in scan outputs
local curated runtime bundle
local registry and mapping schemas
detector-to-AIRI mapping layer
known-gap reporting
provenance and bundle/source labeling

In current scan results, airi_risk_coverage is the main artifact surface for this layer.

The public result contract includes AIRI fields such as:

airi_registry_version
airi_bundle_version
airi_mapping_version
airi_bundle_scope
airi_upstream_snapshot_date
airi_upstream_license
total_risks_in_registry
total_risks_in_bundle
total_risks_in_detector_scope
detectors_triggered
covered_risks
covered_count
coverage_rate
known_gaps_in_bundle
known_gaps_outside_bundle

These fields matter because they let a reviewer distinguish three things that are easy to confuse: the upstream AIRI source, the local runtime bundle, and the detector mapping actually used by the scan.

The important part is not only that these fields exist.

The important part is that AIRI usage becomes auditable from the artifact itself.

If two scans use different AIRI snapshots or mappings, that difference should not be hidden.

Coverage Is Not a Safety Percentage

AIRI coverage in STEM BIO-AI is an audit-surface concept, not a safety percentage.

It does not mean:

the repository is safe
the repository is unsafe
the scanner covers all AI risk
the covered percentage is a compliance score

It means:

a local deterministic finding has been mapped to a known risk-vocabulary entry inside the curated AIRI runtime layer.

That is useful because it gives reviewers a wider frame.

But it does not turn local evidence into a global safety claim.

This is the same discipline used elsewhere in STEM BIO-AI:

scoring is not clinical validation
advisory interpretation is not scoring authority
reproducibility evidence is not automatic score authority
AIRI coverage is not a safety percentage

Each layer has a role.

Each layer has a boundary.

What Changed in 1.7.x

The 1.7.x AIRI story is not simply “we added AIRI.”

The actual change was a move from loose risk labeling toward governed local vocabulary.

1.7.0

AIRI V4 integration appeared in scan outputs.

The scanner began producing an airi_risk_coverage section that maps triggered detector findings to AIRI risk IDs, coverage rate, and known gaps.

The same release also introduced Layer 2 AST contract detectors such as CC1, CC2, and CC3, which expanded the local detector surface available for risk-vocabulary mapping.

1.7.1

AIRI became a governed local data layer.

The architecture separated:

full local registry
curated runtime bundle
detector mapping registry

This release also replaced hardcoded AIRI detector mappings and known-gap lists with packaged local registry files.

Runtime outputs began surfacing registry version, bundle version, mapping version, upstream snapshot date, license, attribution note, and split known gaps into known_gaps_in_bundle and known_gaps_outside_bundle.

1.7.2

No major AIRI architecture change.

The important governance point was regression stability: same-target self-scan comparison verified no drift in airi_risk_coverage alongside score, tier, code contract, detector summary, and evidence ledger count.

1.7.3

No major AIRI architecture change.

The release focused on runtime cleanup, stale demo wording, layout stabilization, and output routing.

1.7.4

AIRI presentation became clearer across demo and report outputs.

The release surfaced AIRI summary material more clearly across the Hugging Face overview card and markdown/explain report sections.

1.7.5

No new AIRI data architecture change.

But artifact-level governance improved more broadly through additive evidence-ledger quality fields and audit-freshness metadata.

That matters because AIRI is most useful when it lives inside a report surface that already carries freshness, evidence quality, and provenance signals.

The important change across the line is this:

AIRI moved from attached dataset toward versioned local risk-vocabulary layer.

What This Still Does Not Do

The AIRI layer still does not:

verify real incidents
prove causality
certify repository safety
replace domain review
turn AIRI categories into deterministic truth claims
collapse the full upstream AIRI universe into the runtime scanner

Those are not missing features.

They are the boundaries that keep the layer useful.

Where This Could Go

The next useful direction is not to overload the scanner with external systems.

It is to improve:

registry provenance
bundle governance
mapping confidence
known-gap clarity
artifact-visible mapping metadata
disciplined links to incident-oriented resources

The broader MIT AIRI ecosystem also includes related incident-oriented resources such as the AI Incident Tracker.

That ecosystem is relevant context, but it is not the same thing as current runtime integration in STEM BIO-AI.

A future version may choose to reference incident-oriented resources more explicitly, but deterministic scans should not ingest them casually or blur them with repository-local findings.

A future version should be able to say not only:

this detector maps to this AIRI risk vocabulary area.

But also:

this mapping has this confidence level, this review status, this local evidence family, and this known limitation.

That is the next governance step.

Final Thought

That is the role of AIRI in this release line.

Not truth replacement.

Not safety certification.

Not incident proof.

A governed vocabulary bridge.

Local evidence first.

External vocabulary second.

Explicit provenance always.

References and Acknowledgment

MIT AI Risk Repository: https://airisk.mit.edu/
MIT AI Incident Tracker: https://airisk.mit.edu/ai-incident-tracker
STEM BIO-AI repository: https://github.com/flamehaven01/STEM-BIO-AI

This AIRI-related direction in STEM BIO-AI was informed by broader public AI risk work, including the MIT AI Risk Repository ecosystem.

The framing around AIRI as a broader risk-vocabulary layer, rather than a repository-local truth layer, was also strengthened by public commentary and ecosystem work from people in this space, including Peter Slattery, PhD.

These references informed the vocabulary and governance direction described here. They do not imply endorsement of STEM BIO-AI or responsibility for its implementation choices.

When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x

Kwansub Yun — Thu, 14 May 2026 05:41:41 +0000

Control slowly becomes authority when nobody marks the boundary.

That is the calibration problem I kept running into while building STEM BIO-AI.

At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.

That was useful.

But it was not enough.

The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.

In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:

deterministic scoring
diagnostic findings
replication evidence
advisory interpretation
domain-specific review posture

They all matter.

But they should not all have the same authority.

That is the core reason calibration became a governance problem in the 1.7.x line.

The principle is simple:

easy experimentation, hard drift
STEM BIO-AI should let researchers express review posture. It should let operators simulate policy changes. It should make policy metadata visible in artifacts.

But it should not let those inputs silently mutate the official score.

A Short Context for New Readers

STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.

It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.

It scans observable repository surfaces such as:

README and docs
code structure
CI configuration
dependency manifests
changelogs
evidence and boundary language

The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:

Stage	Role
Stage 1	README / stated evidence boundary
Stage 2R	repo-local consistency
Stage 3	code and bio-responsibility surface

The active formula still also applies:

C1_penalty when hardcoded credentials are detected
score_cap or t0_hard_floor when clinical-adjacent boundary rules require it

Stage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.

That separation is intentional.

What Is Actually Implemented in the Current 1.7.5 State of 1.7.x

Before discussing calibration philosophy, the implementation boundary has to be clear.

In the current 1.7.5 state of the 1.7.x line, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.

This post describes the current released state of the 1.7.x line as of v1.7.5, not a future authoritative-read-through design.

Implemented surfaces include:

packaged calibration profiles
schema and runtime validation
profile identity surfaced in result metadata
stem policy list
stem policy explain
stem policy derive
stem policy simulate
simulation-only local profile files
profile hashes and read-mode metadata in artifacts

The current named recommendation surface is intentionally narrow:

default
strict_clinical_adjacency

reproducibility_first is still a draft posture, not an active release-grade named recommendation.

The important limitation is this:

the authoritative scan scoring path is still protected from arbitrary user-provided profile mutation.

In other words, scan --policy <name> can surface selected profile metadata. policy derive and policy simulate can show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.

More specifically, local profile files are currently accepted only by stem policy simulate, and the CLI rejects them unless the file remains mirror_only.

That is not a missing convenience.

That is the boundary being tested before it is allowed to become authority.

The Pressure That Causes Drift

One question pushed this design forward:

*If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?
*
I do not think the answer is automatically yes.

If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."

That is usually how audit systems drift.

The score stops being a stable artifact and starts becoming a moving interpretation layer.

The danger is not that users want control.

The danger is that control slowly becomes authority without anyone noticing.

So the design question is not:

How do we let people tune the system more freely?

The design question is:

How do we let people express domain judgment without making the formal score easy to mutate?

That is where calibration enters.

Calibration Is Not a Tuning Console

The wrong calibration UX looks like this:

{
  "stage_1_percent": 30,
  "stage_2r_percent": 25,
  "stage_3_percent": 45,
  "ca_no_disclaimer_cap": 61,
  "b2_partial_credit_mode": "looser"
}

This is editable.

But editable is not the same as governed.

Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:

clinical-adjacent claims should be treated very strictly
reproducibility matters strongly in this environment
README polish should not outweigh code evidence
a casual mention of "limitations" should not count as meaningful transparency

That is why the current calibration design starts with posture questions, not raw constants.

The goal is not to ask a researcher to become a scoring-engine maintainer.

The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.

The `1–5` Scale Is Input, Not Authority

In the current design, the user-facing intent layer uses a 1–5 scale:

1 = minimal emphasis
2 = light emphasis
3 = moderate emphasis
4 = strong emphasis
5 = very strong emphasis

The important line is this:

the 1–5 scale is a UX input surface, not part of the formal score engine.

That means the user can express posture in a natural way:

clinical strictness
code-integrity priority
reproducibility priority
structured limitations requirement

But those answers do not directly become score constants.

They are translated through explicit rules.

The current decision table is intentionally narrow:

Condition	Outcome
`clinical_strictness >= 4` and `reproducibility_priority <= 3`	recommend `strict_clinical_adjacency`
all four values are `2` or `3`	keep `default`
no named-profile rule matches	generate a `preview_only` profile delta from bounded deltas only

This table should not be mistaken for an empirically optimized model.

It is a conservative governance rule table.

The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.

That matters because a calibration system can fail in two opposite ways:

it can be too rigid for domain experts to use
it can be so flexible that every local preference becomes a new score

The initial rule table chooses the safer failure mode.

If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to preview_only.

For example:

clinical_strictness = 4
reproducibility_priority = 4

That does not automatically recommend strict_clinical_adjacency.

It falls back to preview_only, because two strong postures are competing and no release-grade named profile currently resolves that conflict.

A hidden similarity function might produce something that looks more flexible.

But it would also make the governance harder to audit.

A narrow rule table is less magical.

It is also safer.

What the CLI Is Allowed to Do

![Easy experimentation, hard drift — sandbox and vault

The preview workflow can look like this:

stem policy derive \
  --clinical-strictness 5 \
  --code-integrity-priority 4 \
  --reproducibility-priority 3 \
  --structured-limitations-requirement 4

or this:

stem policy simulate /path/to/repo --profile-file my_profile.json

But those flows are not the same as saying:

stem scan /path/to/repo --stage1-weight 0.35 --cap 72

The first two are governed preview surfaces.

The last one is an untracked tuning console.

The design intentionally supports the first and rejects the shape of the last.

This is the practical meaning of easy experimentation, hard drift.

What Actually Gets Verified

The central claim of this design is not:

the current calibration rules are perfect.

The claim is narrower:

calibration changes should not become score authority without a visible governance path.

That claim can be tested by checking whether the system exposes or blocks the relevant control surfaces.

Drift risk	Expected control	How to verify it
arbitrary score tuning	no free-form CLI weight / cap override	CLI help and accepted options do not expose direct score constants
hidden profile mutation	profile status and read mode are surfaced	result artifacts expose profile metadata
unclear profile identity	profile name, version, and hash are visible	scan output includes calibration profile identity
advisory influence leakage	advisory output cannot override score	advisory response validation cannot mutate `final_score`
reproducibility overcompensation	Stage 4 remains separate	`replication_score` does not change `formal_tier`
premature named-profile expansion	ambiguous postures fall back to preview	derive/simulate returns `preview_only` when no named rule matches
detector promotion drift	evidence-only detectors are not score-authoritative	detector policy is versioned in policy files and governance docs, even though per-detector score-integration status is not yet surfaced as first-class artifact metadata

This is still not the same as a full empirical benchmark.

But it is a real verification target.

The system can be checked for whether it allows the forbidden mutation path.

That is the level of proof appropriate for this release line: not "the final policy is optimal," but "the policy cannot quietly become authoritative without leaving a trace."

That trace is stronger for some surfaces than others. Profile identity, hash, and read mode are already artifact-visible in 1.7.5. Detector promotion semantics are already versioned and documented, but they are not yet surfaced as first-class per-detector policy metadata in the result object.

The B2 Tightening Example

The clearest scoring example is Stage 3 B2.

B2 is the bias and limitations measurement surface. Earlier scoring behavior allowed a weaker boundary: a simple vocabulary-level signal could still receive partial credit.

That became too permissive.

A repository that mentions "bias" or "limitations" once is not necessarily disclosing a meaningful boundary. It may only be surface signaling.

So the B2 rule became stricter.

The important change is not a marketing claim about benchmark improvement. The important change is a deterministic boundary change:

Case	Earlier posture	Tightened posture
no bias / limitations vocabulary	0	0
minimal single-term mention only	partial credit possible	0
structured limitations language	partial credit possible	partial credit possible
quantitative measurement evidence	full credit possible	full credit possible

This is the first place where calibration becomes visible as more than a principle.

The rule change creates a concrete score path difference:

a repository that previously depended only on a minimal single-term limitations mention no longer has a B2 partial-credit path after the tightening.

That is the current public claim.

I am not presenting a benchmark-wide before/after score delta here, because that would require a pinned fixture set and published comparison protocol.

Without that, a claimed "T3 became T2" example would be anecdotal at best and misleading at worst.

So the honest evidence level is rule-level impact:

the credit path changed
the changed path is deterministic
the changed path is inspectable
benchmark-level deltas should be published only when the fixture protocol is pinned

In clinical-adjacent repositories, limitation language is not decoration. It is part of the claim boundary.

A one-word mention does not carry the same weight as a structured limitations section, demographic coverage statement, known failure-mode description, or quantitative subgroup analysis.

This is why calibration cannot be only a UI problem.

If a user asks for a stricter limitations posture, the system should not silently subtract points through a hidden override. It should expose the rule that changed and the reason that rule exists.

That is the difference between a score tweak and a governed scoring rationale.

Why Stage 4 Stays Separate

Stage 4 is the place where the strongest counterargument appears.

The counterargument is fair:

If reproducibility is important, why does it not affect the formal score?

My answer is that importance and score authority are not the same thing.

Stage 4 measures replication posture: containers, reproducibility targets, dependency locks, artifact references, seeds, citation surfaces, and similar evidence.

Those signals matter.

But they do not mean the same thing as the formal claim boundary.

A repository can be highly reproducible and still make unsafe or unbounded clinical claims.

A repository can have clean containers and dependency locks while still lacking a clinical-use disclaimer.

A repository can be easy to rerun while still having weak data provenance or shallow limitation language.

If Stage 4 were allowed to lift the formal score too early, reproducibility could start compensating for claim-boundary weakness.

That would be a different scoring philosophy.

It may become valid in the future, but only if the rule is explicit.

For now, Stage 4 is reported as a separate lane because the system is saying:

reproducibility matters
reproducibility should be visible
reproducibility should affect review interpretation
reproducibility should not silently override the formal score boundary

That is why stronger reproducibility intent currently falls back to preview_only instead of becoming a release-grade named profile.

The system is not saying reproducibility is unimportant.

It is saying reproducibility has not yet been granted formal score authority.

Advisory AI Uses the Same Boundary

Advisory AI follows the same rule.

Helpful interpretation is not score authority.

STEM BIO-AI can export provider-neutral advisory packets and validate downstream advisory responses, but the deterministic scanner does not need an external model runtime to produce the formal score.

If an advisory system becomes useful, it may help interpret findings, prioritize review, or explain evidence patterns.

But unless a future release explicitly changes the policy, advisory output remains structurally subordinate to the deterministic score.

That is enough for this article.

The broader advisory boundary is a separate topic.

From Scoring Tool to Audit Workflow

The 1.7.x transition is best understood as a shift in the questions the tool is expected to answer.

Earlier scoring-tool question	Audit-workflow question
What score did the repository get?	Which policy profile was visible when the score was produced?
Which stage contributed most?	Was that stage score-authoritative, diagnostic, or separate-lane evidence?
What evidence triggered the tier?	Did the evidence change the formal score or only the review posture?
What should the user fix?	Would a proposed policy change be preview-only, experimental, benchmark-candidate, or release-authoritative?

This is why I describe 1.7.x as an audit-system transition.

The score still matters.

But the system is increasingly designed around the custody of the score: where it came from, what was allowed to influence it, and what was intentionally kept outside it.

What This Still Does Not Do

This boundary is just as important as the implementation.

STEM BIO-AI still does not:

validate biomedical efficacy
certify benchmark truth
determine clinical deployment safety
let advisory AI overwrite the formal score
open arbitrary numeric tuning in the official scan path
allow profile experimentation to become official policy without governance

Those are not missing conveniences.

They are boundaries.

A strong repository evidence tier is still an observable repository-surface signal. It is not clinical clearance, regulatory approval, or proof of biomedical validity.

The Next Version Direction

The next important step is not adding more knobs.

It is authoritative policy read-through in parity mode.

That means:

the default policy profile becomes the source read by the scoring path
existing fixtures should show no score or tier drift
policy hashes remain visible in artifacts
non-default and researcher-provided profiles remain governed preview surfaces until promoted
score-affecting policy changes become explicit release events

This is not a big-bang rewrite.

It is authority relocation.

The goal is to move score-affecting constants into versioned policy objects without changing the score by accident.

Only after that parity step does it become safe to discuss broader named profiles.

Final Position

The calibration problem is not really about giving users more control.

It is about deciding when control becomes authority.

If every useful signal can gradually influence the score, the score stops being an audit artifact.

It becomes a negotiation.

That is what STEM BIO-AI is trying to avoid.

Researchers should be able to express posture.

Operators should be able to simulate alternatives.

Policy stewards should be able to promote changes.

But the formal score should not move unless the governance path says it moved.

That is the difference between a tuning console and an audit system.

Building a Deterministic Governance Kernel: Separating Custody from Truth

Kwansub Yun — Tue, 12 May 2026 06:00:43 +0000

A governance engine should not pretend to know the truth of every domain.

That was the architectural lesson behind CGF.

At Flamehaven Labs, we build B2B governance engines for highly regulated environments. Over the past year, we developed specialized deterministic systems for different review contexts:

CareChainGovernanceEngine (CCGE): a fail-closed clinical-governance engine for enforcing safety-oriented review gates in bio-AI workflows.
The Analyst's Problem Framework (TAP): a “Proof Custody” engine designed to package and audit mathematical proof candidates.

Both worked inside their own domains. But both also exposed the same architectural problem: reusable custody mechanics were mixed with domain-specific decision semantics.

We needed to audit new targets — such as external open-source intake, RAG retrieval receipts, and AI evolution proposals. If we did not extract a common, domain-neutral kernel, we would be doomed to rewrite the entire scanning, hashing, and reporting pipeline for every new vertical.

The result was the Custody Governance Framework (CGF): a domain-neutral custody kernel for B2B technical review workflows.

This is an architecture extraction note on how we decoupled domain truth from custody mechanics, the API design that powers it, and why we specifically rejected the modern trend of “LLM-agentic” governance.

The Problem with “Agentic” Governance

Many emerging AI governance workflows are becoming document-shaped or agent-shaped: a YAML file, a Markdown policy, or an LLM prompt that says, “check whether this is safe.”

The problem is not that LLMs are useless. The problem is that they can produce compliance-shaped language without producing verifiable compliance artifacts.

In a strict B2B handoff — where auditability, legal review, and future regulatory mapping to frameworks such as the EU AI Act or NIST AI RMF may matter — you cannot rely on non-deterministic evaluations.

CGF takes the opposite approach: strict determinism.

The framework does not own domain-truth semantics. It owns the custody mechanics around findings, profiles, evidence, approvals, and artifacts.

The Architecture: A Deterministic Data Flow

To enforce this separation, we designed CGF as a deterministic pipeline with a narrow side-effect boundary.

The core engine takes a normalized review input and a GovernanceProfile, transforming them into immutable artifact dataclasses. The writer layer then materializes those objects as files, manifests, and release artifacts.

Here is what the end-to-end flow looks like:

The API Boundary

At the core boundary, the framework does not evaluate whether a finding is “bad” by its own logic.

It relies on a StatusDeriver driven by the injected profile.

A simplified sketch of the boundary looks like this:

# Simplified sketch, not the full implementation
@dataclass(slots=True)
class GovernancePipeline:
    profile: GovernanceProfile
    status_deriver: StatusDeriver

    def build_packet(self, result: ScanResult) -> GovernancePacket:
        # 1. Inject profile-specific requirements, such as mandatory surfaces
        governed_result = self._with_profile_requirements(result)

        # 2. Derive deterministic status via the profile
        status_result = self.status_deriver.derive(governed_result)

        # 3. Assemble the immutable packet
        return GovernancePacket(
            target=governed_result.target,
            profile_id=self.profile.profile_id,
            status=status_result.status,
            status_reason=status_result.reason,
            findings=list(governed_result.findings),
            evidence=list(governed_result.evidence),
            compliance_score=_compute_compliance_score(governed_result.findings),
        )

The exact implementation also handles timestamps, approval bridges, validation, artifact writing, and manifest verification.

The important point is architectural: the core does not decide domain truth. It records how a profile interpreted the evidence.

Architecture Highlight 1: Inspectable Artifacts Over Silent Mutation

A major risk in governance automation is the engine silently mutating the target repository, for example by automatically injecting compliance boilerplate.

Many tools solve this with a --dry-run flag that prints logs to stdout.

In a B2B audit, stdout logs are not enough. You need an auditable, verifiable artifact.

CGF implements a preview-first artifact flow. When the pipeline runs, it does not mutate the target repository by default. Instead, it consumes a normalized review input and emits a deterministic custody bundle:

$ cgf run --profile proof_custody.json --scan target_scan.json --out audit/

In real deployments, target_scan.json may be produced by a vertical adapter, repository scanner, RAG receipt processor, or customer-specific intake layer.

The output is an inspectable custody bundle:

audit/
├── governance_packet.json       # Machine-readable audit state
├── preview_report.md            # Human-readable summary
├── chain_ribbon.md              # Markdown tag for custody-chain review state
└── manifest.json                # Artifact manifest with file hashes

The important point is not that CGF edits the repository.

It does not.

The important point is that the review state, findings, proposed next actions, and artifact hashes become inspectable before any external system decides what to do next.

Architecture Highlight 2: The Reality of Audit Chains

In B2B handoffs, customers often ask for tamper-resistance.

CGF supports a GovernanceAuditChain, an append-only JSONL ledger where packet records can be linked through SHA-256 hashes.

But we need to be honest about the tradeoff:

Local hash chains are tamper-evident, not tamper-resistant.

If a bad actor has write access to the filesystem, they can delete the audit directory and regenerate the entire chain from scratch.

CGF does not use a blockchain. The cost and complexity of a distributed ledger would outweigh the benefits for local repository scanning.

Instead, CGF provides local tamper-evidence.

To achieve true tamper-resistance, the deployment environment still matters: CI/CD artifact signing, external timestamping, identity providers, or customer-controlled archival systems.

For example, an external identity token can be wired into an approval bridge:

# Wiring an external identity token into the local chain
bridge = ApprovalBridge(
    state="approved",
    approved_by="compliance_lead_JWT_subject",
    notes="Cryptographic signature from Identity Provider XYZ",
)

approved_packet = GovernancePipeline.apply_approval(packet, bridge)

The framework provides the cryptographic hooks and chronological integrity.

The deployment environment provides the immutability.

Current Limitations

This post is not a release announcement or a regulatory certification claim.

It is an architecture note about the extraction pattern: how we separated reusable governance mechanics from domain-specific truth semantics.

CGF is still early. It is not a compliance platform, not a hosted governance service, and not a regulatory certification product.

That distinction matters.

CGF does not prove that a medical system is safe. It does not prove that a mathematical argument is true. It does not certify legal compliance. It does not replace domain experts, auditors, clinicians, lawyers, or reviewers.

What it does is narrower, but more concrete:

It makes the custody surface inspectable.

There are also practical limitations.

As mentioned, local hash chains are tamper-evident, not tamper-resistant. True immutability still has to come from the deployment environment: CI/CD signing, external timestamping, identity providers, or customer-controlled archival systems.

CGF is also not yet a complete enterprise governance platform. Authentication, RBAC, multi-tenant profile registries, async approval workflows, and regulatory citation mapping are still roadmap items, not solved infrastructure.

Each domain still needs adapters, profiles, thresholds, and human review policies.

The kernel provides the custody mechanics. The domain owner still has to define what evidence matters.

That is intentional.

A generic governance kernel should not pretend to know the truth of every field.

Roadmap

The roadmap is not to turn CGF into a giant all-knowing compliance agent.

The roadmap is to keep the kernel small, deterministic, and inspectable while adding stronger boundaries around the places where real B2B workflows need them.

The next layers are:

Regulatory mapping: mapping finding codes to frameworks such as the EU AI Act, NIST AI RMF, and ISO/IEC 42001 without turning CGF itself into a legal authority.
Approval policy hardening: adding stronger policy checks around ApprovalBridge so approvals can be scoped, expired, and externally verified.
Async approval workflows: allowing human review, compliance sign-off, or customer approval to arrive after the initial custody packet is generated.
Profile registries: supporting versioned, tenant-scoped governance profiles so different customers can use different policies without changing the kernel.
Signed external receipts: allowing RAG systems, technology scanners, quality engines, and external tools to produce receipts that CGF can verify and attach.
Vertical adapters: binding existing domain systems back to the kernel without importing their domain-specific truth semantics into the core.

Every roadmap item has to preserve the same rule:

The kernel may govern custody, but it must not absorb domain truth.

The direction is deliberately conservative: more custody, more verification, more explicit boundaries — not more autonomous magic.

Why This Matters

A lot of AI governance today is still document-shaped.

A policy lives in Markdown. A checklist lives in YAML. A prompt says the system should be safe, transparent, aligned, compliant, or human-reviewed.

Those documents are not useless. They are often necessary.

But they are not governance by themselves.

Governance becomes real only when it has an execution surface:

a typed input boundary
normalized findings
profile-owned status derivation
explicit evidence references
generated review artifacts
manifest hashes
approval metadata
release bundles
verification commands
clear non-goals

That is the difference CGF is trying to make.

It is not another Markdown file describing how governance should work.

It is a deterministic custody pipeline that turns review inputs into inspectable artifacts.

The goal is not to make governance sound more sophisticated.

The goal is to make it harder to fake.

A governance system should leave behind more than confidence.

It should leave behind artifacts.

Conclusion

Extracting the Custody Governance Framework taught us that governance architecture has to separate process from truth.

Truth belongs to domains:

medicine
mathematics
law
security
finance
science

Process belongs to the governance kernel:

what was reviewed
which profile was applied
which findings fired
what evidence was attached
what status was derived
who approved it
whether the resulting artifacts can be verified later

That separation is the reason CGF exists.

It does not try to be an AI judge. It does not ask an LLM to guess whether a system is compliant. It does not hide governance inside a prompt, a policy document, or a YAML file.

It creates custody artifacts that can be inspected.

For us, that is the real boundary between governance as language and governance as infrastructure.

From Score to Workflow: Turning STEM BIO-AI Into a Local Audit System

Kwansub Yun — Fri, 08 May 2026 08:25:50 +0000

Earlier in this series, I wrote about why bio/medical AI repositories need more than benchmarks, what I learned after auditing 10 public repositories, and why an AI auditor itself needs a memory contract.

That work led to STEM-AI v1.1.2 and the MICA layer: a memory-contracted initialization step that forces the auditor to load bounded rules before scoring begins. If you have not read that part, the relevant post is here:

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

For the broader arc, the full series is here:

STEM-AI / STEM BIO-AI series

But after that, a different engineering problem took over.

The audit logic was stricter.

The reports were richer.

The reasoning was more bounded.

But the developer workflow still felt too loose.

So the next question was no longer:

How do I score trust?

It became:

How does a bio-AI audit tool become something an engineer can actually run, gate, inspect, and integrate?

The answer turned out to be less about seeing more signals and more about refusing to confuse them.

That is the core argument of this post:

A detector becomes more trustworthy when it is strict about what it cannot conclude.

Once I took that seriously, STEM BIO-AI stopped looking like “one score plus some extra metadata” and started looking like a system with distinct lanes, distinct boundaries, and distinct operator workflows.

The problem was no longer scoring

By the time I reached the 1.6.x line, the rubric was no longer the main bottleneck.

The bottleneck was operational clarity.

A trust audit tool is not very useful if:

the normal path is one long command with too many flags
CI has to reverse-engineer the result from human-readable stdout
bio-specific diagnostics are mixed directly into the same surface as formal scoring
regulatory relevance shows up as vague implication instead of explicit traceability
advisory AI is present, but its relationship to the official score is unclear

At that point, the tool stops being hard to trust for conceptual reasons and starts being hard to trust for operational reasons.

That is a different class of problem.

The CLI had to reflect operator intent

The earlier CLI was functional, but too flat.

You could do things like this:

stem /path/to/repo --level 3 --format all --explain
stem /path/to/repo --tier-gate T3 --format json --quiet
stem /path/to/repo --advisory packet
stem /path/to/repo --advisory-response provider_advisory.json

All of that worked.

The issue was that it treated very different operator intents as one long option surface.

In practice, these are separate workflows:

scan a repository and generate artifacts
enforce a gate in CI/CD
export a bounded advisory packet
validate a downstream provider response
cross an explicit provider-call boundary

So I refactored the CLI around workflows instead of flag accumulation:

stem scan <folder>
stem gate <folder> --min-tier T2
stem advisory validate <folder>
stem advisory packet <folder>
stem advisory call <folder>
stem advisory check-response <folder> --response FILE

The older paths still exist for compatibility:

stem <folder>
stem audit <folder>
stem <folder> --tier-gate T2 --quiet
stem <folder> --advisory packet

But they are no longer the conceptual center.

That matters more than it sounds.

Once the command names match the operator’s intent, the system becomes easier to teach, easier to remember, and easier to wire into pipelines.

This is not just a DX cleanup. In a medical or bio-adjacent audit context, command ambiguity is part of the trust problem.

Repository trust needed four separate lanes

This was the biggest architectural shift.

I stopped treating repository trust as one object.

In practice, it needed four separate lanes:

deterministic structural scoring
deterministic diagnostics
regulatory traceability
optional AI advisory

If all of those collapse into one final confidence score, the tool becomes harder to reason about.

The more regulated the domain, the more dangerous it becomes to collapse every useful signal into one score.

Some evidence should change the score.
Some evidence should only raise review priority.
Some evidence should support traceability.
Some evidence should be handed to a human or advisory system.

The maturity of the tool is not that it sees all of them.

The maturity is that it does not confuse them.

This separation is not just conceptual. It exists in the code path.

One reasonable objection to any architecture write-up is: are these really separate lanes, or are they just different labels on the same output object?

In STEM BIO-AI, the answer is visible in the execution order.

The scanner computes the formal score first. In the result object, that means keys like:

Stage 1
Stage 2R
Stage 3
risk penalty
score cap
final_score
formal_tier

Only after that does it append the non-scoring layers, again as explicit result keys:

regulatory_basis
stage_traceability
regulatory_traceability
reasoning_model
optional ai_advisory

That ordering matters.

The score is not derived from the advisory lane.
The regulatory mapping does not mutate the formal tier.
The diagnostics lane can emit evidence without becoming a hidden score multiplier.

This is also why the JSON shape ended up more layered than earlier versions. The output had to preserve the distinction the code was already enforcing.

That execution order is the architectural reason the next four sections exist.

Once I had the lanes separated in code, each lane needed its own claim boundary, its own output semantics, and its own reason for not being collapsed into the others.

Put differently, the next four sections answer four different questions:

what is allowed to change the formal tier
what is useful enough to emit, but not yet mature enough to score
what can support regulatory review without pretending to be compliance
what can involve AI without letting AI become the scoring authority

1. Deterministic structural scoring

This remains the official score and tier.

It measures the main repository-visible signals:

README evidence
repo-local consistency
code and bio responsibility
dependency hygiene
changelog and provenance surfaces
code-integrity patterns

This lane is local, deterministic, and machine-checkable.

That is the part that can legitimately drive a formal triage tier.

I am not claiming this is the only possible architecture. A different system could have folded diagnostics or replication more aggressively into one unified score.

I chose not to, because the narrower score proved easier to defend. A smaller claim with cleaner boundaries was more valuable here than a broader score with ambiguous semantics.

2. Deterministic diagnostics

This is where the deterministic diagnostics spec became important.

I needed a place for findings that are real, useful, and inspectable, but should not silently perturb the main score until they are calibrated.

That is what docs/DETERMINISTIC_DIAGNOSTICS.md defines.

It separates the diagnostic problem into two lanes:

Lane A: deterministic local diagnostics
Lane B: optional AI-assisted semantic review

That separation is central.

The deterministic lane is authoritative for hard findings.
The AI lane is advisory only.

The local diagnostic lane currently focuses on evidence-bearing bio-specific signals such as:

malformed or suspicious SMILES-like outputs
missing parser guards
silent mock or simulated-data fallbacks
risky subprocess construction around bio tools
traceability manifest surfaces

The point was not to create a “bio slop detector” with a catchy label.

The point was to create a local evidence lane that could say:

here is the file
here is the line
here is the snippet
here is the bounded interpretation

That is much more useful than a vague semantic warning.

Why diagnostics stayed evidence-only

This was one of the harder engineering decisions.

It would have been easy to push every new bio-specific detector directly into the final score.

I did not do that.

The deterministic diagnostics spec is explicit that many of these findings begin as evidence-only. In practice, they are emitted as line-level records in the result object's evidence_ledger:

findings are emitted into the result object’s evidence_ledger
findings appear in Markdown and --explain
findings do not change final_score or formal_tier

That is the right default.

For example, the SMILES lane can be very useful for detecting:

malformed surface strings
low-entropy placeholders
repeated trivial outputs
missing parser guards

But it does not prove:

medicinal usefulness
synthetic feasibility
binding plausibility
biological efficacy
full chemical validity in every edge case

That boundary is important.

A detector becomes more trustworthy when it is strict about what it cannot conclude.

Just as importantly, this is not meant to be a permanent holding area for every detector. The diagnostics spec is explicit that score impact should only happen after commit-pinned benchmark evidence, explicit false-positive review, and reproducible calibration. In other words, evidence-only is the temporary safe default until a detector has earned score authority.

3. Regulatory traceability

The second document that became central was docs/REGULATORY_MAPPING.md.

This solved a different problem.

Once you audit clinical-adjacent repositories, people naturally ask:

does this align with EU AI Act themes?
does this help with FDA-oriented review?
is there anything relevant to IMDRF or SaMD evidence families?

The wrong answer would be to turn those questions into a fake compliance score.

So I did the opposite.

The regulatory layer is explicitly framed as:

a traceability aid, not a compliance verdict

That document maps observed evidence classes to requirement families with bounded confidence labels like:

strong
moderate
weak-moderate
weak
not assessed

And it makes an important distinction:

the confidence applies to the mapping relationship, not to legal acceptability.

Those confidence labels are not model outputs and they are not inferred at runtime. They are fixed, rule-level mapping judgments attached to evidence classes in the mapping document itself. For example, changelog / checksum / config-manifest style evidence is treated as a moderate traceability signal for Article 12-style review, while human-oversight interface signals stay weak because interface presence is not the same thing as oversight procedure.

That means the tool can say things like:

versioned manifests and changelogs may support record-keeping / traceability review
intended-use and disclaimer sections may support transparency scaffolding review
override interfaces may support human-oversight interface review
subgroup measurement language may support weak evidence of data-governance intent

without claiming:

legal compliance
regulatory clearance
clinical certification
deployer conformance

In a regulated domain, traceability is useful only when it does not pretend to be permission.

A concrete example: why Article 12 is traceability, not compliance

The best example here is EU AI Act Article 12 style traceability.

The regulatory mapping layer treats signals like:

changelogs
checksum manifests
versioned config surfaces
audit-log schema fragments
decision-event or override-event schema tokens

as evidence that a repository may have traceability scaffolding.

That is useful.

It is also bounded.

The mapping document is explicit that changelog presence is not the same thing as deploy-time event logging, and that current scope does not establish runtime log completeness.

So the output can legitimately say:

there is structural evidence relevant to traceability review

while refusing to say:

this system satisfies traceability obligations

That is exactly the kind of distinction I wanted this lane to enforce.

What this buys in practice is not a compliance shortcut, but a faster review question. If a repository exposes none of the scaffolding signals in this lane — no change history, no artifact hashes, no versioned manifests, no event-schema surfaces — then there is very little reason to treat it as traceability-ready for deeper institutional review. If those signals do exist, the next step is still expert inspection, but the scanner has at least opened the right folder and pointed at the right files.

Why regulatory mapping stayed subordinate to evidence

This was non-negotiable.

Regulatory relevance had to remain downstream from evidence, not a score multiplier pretending to be law.

That is why the output shape separates things like:

regulatory_basis
stage_traceability
regulatory_traceability

from the actual score computation.

And it is not just decorative structure.

The regulatory basis object is registry-driven. It can mark review_required when the basis registry is stale or required source families are missing. That is a traceability control on the mapping layer itself, not an input into the scoring formula.

This is also why the regulatory note belongs in a muted traceability panel, not next to the main score.

If a repo has traceability-relevant scaffolding, that is useful.

If a repo has traceability-relevant scaffolding, that is still not compliance.

The distinction has to remain visible in both the code and the artifacts.

4. Optional AI advisory

The fourth lane is the advisory layer.

This one exists for bounded model-assisted review, but it does not get to rewrite the official outcome.

That means workflows like:

stem advisory packet /path/to/repo
stem advisory check-response /path/to/repo --response provider_advisory.json

can exist without creating ambiguity about who owns the formal result.

The advisory layer can:

export a provider-neutral packet
validate downstream response structure
enforce finding-ID citation rules
reject prohibited claims
surface runtime and secret boundaries

What it cannot do is silently override:

score.final_score
score.formal_tier

How that rule is actually enforced

This is not just policy language in the README.

The advisory validator explicitly checks for score-override attempts. If a response includes fields like:

final_score
formal_tier
replication_score
replication_tier

or sets final_score_override, the response is marked invalid with final_score_override_requested.

The packet contract also exports the rule in plain language:

Do not modify or override final_score, formal_tier, replication_score, or replication_tier.

And provider responses must cite exact values from allowed_finding_ids; citation strings are not repaired or loosely matched later.

So the advisory lane is bounded in two ways:

it has no authority to change the deterministic result
it cannot cite evidence outside the bounded packet

That is the kind of mechanism I mean when I say “better boundaries.” If the rule cannot be checked, it is not really part of the architecture yet.

What operational use looks like now

Once these lanes were separated, the CLI became much easier to reason about.

Local engineering review:

stem scan /path/to/repo --level 3 --format all --explain

CI/CD gate:

stem gate /path/to/repo --min-tier T2 --summary off --output results

Offline advisory packet generation:

stem advisory packet /path/to/repo --output advisory_out

Downstream provider response validation:

stem advisory check-response /path/to/repo --response provider_advisory.json

The important point is not just that these commands exist.

It is that each one represents a distinct trust boundary.

That made the project feel more like engineering infrastructure and less like a scoring demo.

A real v1.6.2 packet

To make that less abstract, I re-ran STEM BIO-AI v1.6.2 against a local clone of ClawBio, which describes itself as a local-first, privacy-focused, reproducible bioinformatics-native AI skill library.

The command was:

python -m stem_ai.cli scan /path/to/ClawBio --level 3 --format all --explain

On my machine, that run took about 9.4 seconds and emitted the usual CLI output set: a machine-readable JSON result, a Markdown report, a 5-page PDF packet, and a line-level explain trace.

Before the numbers, the important context is that STEM BIO-AI uses a published triage scale:

T0 = 0-39
T1 = 40-54
T2 = 55-69
T3 = 70-84
T4 = 85-100

Stage 4 replication is reported separately as its own lane, where R2 means some reproducibility scaffolding is present, but not yet enough to call the repository replication-strong.

Governance note:
This is not a “bad repository” scoreboard, a clinical safety verdict, or a moral ranking. It is a deterministic evidence-surface pre-screen intended to support review, not replace it.

With that in mind, the result was:

67 / 100
T2 Caution
Replication lane: 55 / 100 (R2)
Clinical adjacency: CA-DIRECT (the repository surface makes direct healthcare-facing claims, even though it also carries an explicit non-clinical boundary)
Code integrity warnings: C2 dependency pinning, C4 exception handling

This is exactly the workflow shift I wanted the tool to support.

The same deterministic scan is rendered into multiple operator surfaces:

JSON for automation
Markdown for review
PDF for human-facing packet inspection
--explain for file / line / snippet proof tracing

That output shape is only possible because the result object already separates:

formal score and tier
replication lane
diagnostics lane
regulatory traceability
advisory boundary state

In other words, the PDF is not a separate product. It is a view over the same bounded audit object.

Two details from this run are worth calling out.

First, the scanner did not manufacture chemistry findings just because ClawBio is bio-adjacent. The deterministic diagnostics lane reported:

SMILES Surface Integrity: not_detected
SMILES RDKit Validation: not_applicable
SMILES Parser Guard: not_detected

That is the behavior I want. If a detector has no evidence, it should stay silent instead of inflating the report with domain-flavored noise. This is what the earlier thesis looks like when it hits real output: a detector becomes more trustworthy when it is strict about what it cannot conclude.

Second, the score is strict about observable repository conventions. ClawBio uses ClawBio_README_Repo.md rather than a root README.md, so the scan records S1_missing_readme: -20. A human reviewer might decide that this is acceptable contextually. The scanner does not make that leap for them. It only records what the repository exposes through the surfaces it knows how to measure.

That distinction matters. A T2 Caution result here does not mean “ClawBio is unsafe.” It means the current repository surface still raises review-relevant signals under the published deterministic rules, including dependency-pinning warnings, exception-handling warnings in a clinical-adjacent surface, and a stricter-than-human README convention check.

And that is exactly why the next section matters: once the workflow is concrete, the remaining question is not whether the tool can produce an answer, but where its current boundaries still need to stay visible.

What still has to stay bounded

The system is better than it was, but there are still obvious next steps.

1. The public surface is broad

There is now:

scoring
diagnostics
replication
advisory packeting
regulatory traceability
JSON / Markdown / PDF / explain outputs

That is useful, but it increases onboarding cost.

The CLI is clearer now, but the broader public surface has to stay disciplined.

2. The deterministic diagnostics lane is still missing a published calibration threshold

The diagnostics lane is evidence-first by design, but one practical gap remains: the public release does not yet ship a benchmark-backed threshold document saying exactly when a detector is mature enough to graduate from evidence-only into score-bearing territory.

Right now the rule is conceptually clear:

commit-pinned fixtures
reproducible detector output
explicit false-positive review

But the public decision boundary is still partly narrative. Until that calibration surface is published in a more operational form, keeping diagnostics evidence-only is the safer choice.

3. The regulatory confidence labels are rule-authored, not empirically validated

The mapping labels like strong, moderate, and weak-moderate are currently fixed rule-level judgments in the mapping document. They are not runtime model outputs, but they are also not yet backed by inter-rater reliability studies or a published reviewer-agreement benchmark.

That means they are useful as bounded structural mapping language, but they should not be treated as empirical proof that multiple auditors would converge on exactly the same label distribution.

Earlier context

Try it yourself

STEM BIO-AI is Apache 2.0 and fully open source.

If you want to know whether a bio/medical AI repository is actually exposing reviewable evidence, or whether your own repository is weaker than you think, run it yourself.

That is the real test.

GitHub: https://github.com/flamehaven01/STEM-BIO-AI
License: Apache 2.0

Final thought

The earlier STEM-AI posts were about why repository trust deserves its own audit layer.

This phase was about something more practical:

what does that audit layer have to look like if an engineer is actually going to run it, inspect it, and put it in a pipeline?

For me, the answer was simple:

Separate the workflows.
Separate the lanes.
Keep diagnostics evidence-first.
Keep regulatory mapping subordinate to evidence.
Keep advisory AI bounded.

Optimize for inspectability, not just score production.

That is what changed the project.

Not bigger claims.

Better boundaries.

How Do You Trust the AI Auditor? STEM-AI v1.1.2 and Memory-Contracted Bio-AI Audits

Kwansub Yun — Tue, 28 Apr 2026 13:51:45 +0000

Previous article:
How Auditing 10 Bio-AI Repositories Shaped STEM-AI

In the first STEM-AI write-up, I described what happened after auditing 10 open-source bio/medical AI repositories.

The important lesson was not just that some repositories lacked clinical disclaimers, tests, or governance artifacts.

The more useful lesson was this:

Text-only review is too weak for bio/medical AI. You have to inspect the code path.

That worked.

But it exposed the next problem.

If an AI system is auditing another AI or bioinformatics repository, how do you trust the auditor?

LLMs drift.
One session can enforce a clinical boundary strictly.
Another can invent a generous middle score for the same boundary case. In normal software review, that is annoying. In medical AI governance, it is a liability.

STEM-AI v1.1.2 is my answer to that problem.

It does not try to make the LLM deterministic by writing a longer prompt.

It binds the audit to a memory contract.

What v1.1.2 adds

STEM-AI v1.1.2 introduces MICA: Memory-Injected Contract Architecture.

The idea is simple:

before the auditor reads the target repository, it must load a fixed audit contract and self-check the rules it is not allowed to bend.

The v1.1.2 layer includes:

memory/mica.yaml -- composition contract
memory/stem-ai.mica.v1.1.2.json -- machine-checkable memory archive
memory/stem-ai-playbook.v1.1.2.md -- session playbook and drift guard
memory/stem-ai-lessons.v1.1.2.md -- historical failure-mode archive
spec/STEM-AI_v1.1.2_CORE.md -- canonical audit spec

The contract pins 18 invariants.

Examples:

Stage order is fixed: README intent, cross-platform evidence, code/bio evidence.
Stage weights are fixed.
Tier boundaries are fixed.
T0_HARD_FLOOR cannot be bypassed.
Stage 2 may use external evidence or Stage 2R repo-local consistency in LOCAL_ANALYSIS mode.
Governance overlay cannot raise the formal base tier.
C1-C4 code-integrity checks only run in LOCAL_ANALYSIS mode.
Mandatory clinical-use disclaimers cannot be omitted.

This is not a claim that the LLM becomes perfectly deterministic.

It is a narrower claim:

The auditor is forced to operate inside a contract whose scoring rules, hard floors, and evidence requirements are inspectable.

That is the useful layer.

What "loading the contract" means

MICA is not hidden model memory.

It is also not a claim that the model provider changed the LLM.

In v1.1.2, "loading the contract" means the audit session starts by reading a fixed set of repository files before it is allowed to score the target:

memory/mica.yaml
memory/stem-ai.mica.v1.1.2.json
memory/stem-ai-playbook.v1.1.2.md
memory/stem-ai-lessons.v1.1.2.md
spec/STEM-AI_v1.1.2_CORE.md

The auditor then performs a pre-execution contract test:

confirm the canonical spec exists
confirm the memory archive exists
confirm the invariant count is 18
confirm the fixed tier boundaries are present
confirm the Stage 2 / Stage 2R lane rule is present
confirm Stage 3G cannot raise the formal tier
confirm C1-C4 mode gating is active

Only after that does the audit proceed.

This does not make the LLM mathematically deterministic.

It makes the audit procedure file-backed, inspectable, and interruptible. If the session cannot load or reconcile the contract files, the correct behavior is to stop before scoring.

That is the difference between "please be consistent" and "execute this versioned contract."

The audit workflow

STEM-AI v1.1.2 runs as a structured audit workflow:

In LOCAL_ANALYSIS mode, the auditor is not limited to what the README says.

It can inspect:

package metadata
workflow files
test definitions
dependency manifests
source-code paths
deprecated or dead-code paths
exception handling
credential patterns
provenance and hash-checking logic

The output is intentionally split into two files:

report.md                  # human-readable audit judgment
experiment_results.json    # machine-readable evidence and score object

That split matters.

The report explains the reasoning.

The JSON lets another reviewer inspect the score, evidence fields, flags, and integrity checks without trusting the prose.

A real target audit, not a synthetic example

For this v1.1.2 demonstration, I used a real public repository:

artic-network/fieldbioinformatics

The target is not the protagonist of this post.

It is only the specimen used to show the audit workflow against a real bioinformatics codebase.

The local audit produced:

audits/fieldbioinformatics_v1_1_2/report.md
audits/fieldbioinformatics_v1_1_2/experiment_results.json

The target snapshot:

{
  "name": "artic-network/fieldbioinformatics",
  "remote": "https://github.com/artic-network/fieldbioinformatics",
  "branch": "master",
  "commit": "8008b4c97c2193a82308ff6f0be507b1d9306e36",
  "file_count": 114
}

This is the important part: the audit did not ask, "Does this README sound trustworthy?"

It asked:

Do README claims match actual package metadata and entry points?
Are there real CI and domain-specific tests?
Are dependencies reproducible enough?
Are there credential leaks?
Are there deprecated patient-adjacent paths?
Do clinical-adjacent output paths fail closed?
Does the repository include governance evidence, or only governance absence?

That is where STEM-AI is useful.

The score object

The machine-readable result records the score like this:

{
  "stage_1_readme_intent": 65,
  "stage_2_cross_platform": null,
  "stage_2_repo_local_consistency": 75,
  "stage_2_lane": "STAGE_2R_REPO_LOCAL_CONSISTENCY",
  "stage_2_policy": "External Stage 2 was not collected; LOCAL_ANALYSIS used Stage 2R in the fixed 0.20 Stage 2 slot.",
  "stage_3_code_bio": 55,
  "weights": {
    "stage_1": 0.4,
    "stage_2": 0.2,
    "stage_3": 0.4
  },
  "risk_penalty": 0,
  "final_score": 63,
  "formal_tier": "T2 Caution"
}

External Stage 2 is explicitly represented as null for this local-only audit.

That does not mean cross-platform consistency is unimportant.

It means this evidence slice was deliberately scoped to LOCAL_ANALYSIS. Instead of pretending to have social/web evidence, v1.1.2 uses Stage 2R: Repo-Local Consistency.

Stage 2R asks whether the repository's own surfaces agree with each other:

README vs package metadata and CLI entry points
README vs docs, tutorials, and troubleshooting
README test claims vs CI workflow and test definitions
clinical-adjacent outputs vs local intended-use boundaries

The contract defines the fixed-weight calculation:

Final = (Stage 1 x 0.40) + (Stage 2R x 0.20) + (Stage 3 x 0.40) - Risk Penalty
      = (65 x 0.40) + (75 x 0.20) + (55 x 0.40) - 0
      = 63

The final tier is therefore:

T2 Caution

Not because the prose sounded balanced.

Because the contract math forces that result.

Why the T0 hard floor did not trigger

T0_HARD_FLOOR is the rule that prevents a clinically dangerous repository from escaping rejection through good wording.

In simplified form:

If a repository is CA-DIRECT
and it has no substantive code implementation,
then final tier = T0 regardless of score math.

Examples of CA-DIRECT include patient-specific diagnosis, treatment recommendation, triage, risk scoring, or clinical decision support.

The audited repository did not trigger that floor because STEM-AI classified it as:

{
  "clinical_adjacent": true,
  "ca_severity": "CA-INDIRECT",
  "t0_hard_floor": false
}

It produces biological sequence artifacts that may sit near public-health or clinical workflows, but the inspected surface did not make direct autonomous diagnosis or treatment claims. It also has substantive implementation, CI, and domain-specific test definitions.

So the result is not T0.

But it is also not high-trust.

The bounded result is T2 Caution.

Code-integrity findings

The same JSON records C1-C4 LOCAL_ANALYSIS checks:

{
  "C1_hardcoded_credentials": {
    "status": "PASS"
  },
  "C2_dependency_pinning": {
    "status": "WARN"
  },
  "C3_dead_or_deprecated_patient_adjacent_paths": {
    "status": "WARN"
  },
  "C4_exception_handling_clinical_adjacent_paths": {
    "status": "WARN"
  }
}

That is the difference between a general review and a code-path audit.

A text review can say:

The project appears technically mature.

A code-path audit can say:

Credential patterns were checked. Dependency pinning is weak. Deprecated patient-adjacent metadata exists. One clinical-adjacent filtering path does not fail closed on missing depth.

That is a more useful governance object.

It is not a certificate.

It is a map of what a reviewer should trust, distrust, or inspect next.

A small Python verifier

Here is a small dependency-free Python script that reads the actual audit JSON and verifies the score calculation. It does not need target private code or patient data; it only checks the machine-readable audit result.

#!/usr/bin/env python3
import json
from pathlib import Path


RESULT = Path("audits/fieldbioinformatics_v1_1_2/experiment_results.json")


def tier(score: int) -> str:
    if score <= 39:
        return "T0"
    if score <= 54:
        return "T1"
    if score <= 69:
        return "T2"
    if score <= 84:
        return "T3"
    return "T4"


def bar(score: int, width: int = 20) -> str:
    filled = round((score / 100) * width)
    return "█" * filled + "░" * (width - filled)


data = json.loads(RESULT.read_text(encoding="utf-8"))
score = data["score"]

stage_1 = score["stage_1_readme_intent"]
stage_2 = score.get("stage_2_repo_local_consistency")
stage_3 = score["stage_3_code_bio"]
risk_penalty = score["risk_penalty"]

weights = score["weights"]
computed = round(
    (stage_1 * weights["stage_1"])
    + (stage_2 * weights["stage_2"])
    + (stage_3 * weights["stage_3"])
    - risk_penalty
)

assert computed == score["final_score"]
assert tier(computed) in score["formal_tier"]

print(f"Stage 1  {stage_1:3}/100  {bar(stage_1)}")
print(f"Stage 2R {stage_2:3}/100  {bar(stage_2)}")
print(f"Stage 3  {stage_3:3}/100  {bar(stage_3)}")
print(f"Final    {computed:3}/100  {bar(computed)}")
print(f"Tier     {score['formal_tier']}")

for key, item in data["code_integrity"].items():
    print(f"{key}: {item['status']}")

Expected digest:

Stage 1   65/100  █████████████░░░░░░░
Stage 2R  75/100  ███████████████░░░░░
Stage 3   55/100  ███████████░░░░░░░░░
Final     63/100  █████████████░░░░░░░
Tier      T2 Caution
C1_hardcoded_credentials: PASS
C2_dependency_pinning: WARN
C3_dead_or_deprecated_patient_adjacent_paths: WARN
C4_exception_handling_clinical_adjacent_paths: WARN

Why this matters

Bio/medical AI governance is full of language that sounds safe but is hard to verify:

"research use only"
"not medical advice"
"validated pipeline"
"clinical-grade"
"responsible AI"
"human-in-the-loop"

Those phrases are not enough.

STEM-AI asks for observable structure:

source-code reality
test reality
CI reality
dependency reality
clinical boundary reality
governance artifact reality
code-integrity reality

v1.1.2 adds another layer:

auditor reality.

The AI auditor itself has to load a memory contract before it scores.

That is what MICA is for.

The final answer is T2 Caution: research reference and supervised non-clinical technical review only. No autonomous clinical decision support.

Not hype.

Not rejection by default.

A bounded trust judgment with evidence paths.

What comes next

The follow-on lane should:

provision the target dependency environment
run selected target tests in a controlled shell
capture command, exit code, environment hash, and output digest
attach a replay manifest to experiment_results.json
keep runtime evidence separate from source/document/CI evidence

For the current demonstration, runtime execution status is recorded as an evidence boundary in the audit JSON. The score itself remains based on the official v1.1.2 LOCAL_ANALYSIS evidence basis: Stage 1 source/README evidence, Stage 2R repo-local consistency, Stage 3 code/bio evidence, and C1-C4 integrity checks.

Final thought

STEM-AI is not a clinical certifier.

It is also not trying to replace scientific review, regulatory review, or domain experts.

Its role is narrower: make the governance conversation start from observable evidence instead of presentation quality.

In practice, that means asking:

What did the repository claim?
What does the code actually implement?
Do the local surfaces agree with each other?
Are the tests domain-specific or merely infrastructural?
Are clinical-adjacent boundaries explicit?
Can the auditor's own scoring logic be inspected?

That is where I think STEM-AI belongs in AI governance.

Not as the final authority.

As the evidence gate before authority is invoked.

It turns a vague question, "Do we trust this bio/medical AI repository?", into a more reviewable one:

Does this repository establish enough observable trust to be considered, contained, or rejected?

Each /slop Is a Calibration Signal — AI-SLOP Detector v3.6.0 and the Claude Code Skill

Kwansub Yun — Tue, 28 Apr 2026 12:04:44 +0000

AI-assisted development has a quiet failure mode: the assistant that creates the pattern often becomes the assistant that reviews it.

When you and Claude work inside the same session, you drift together. The review criteria shift with the assistant's habits. After enough sessions, the same assistant that wrote the hollow function body is also the one approving the pull request. There is no external reference point — unless you build one.

That is the problem AI-SLOP Detector v3.6.0 addresses with the Claude Code skill.

Every time you run /slop inside a session, the scan result is recorded to a project-scoped history. When enough re-scan evidence accumulates, bounded self-calibration adjusts the detection weights for your codebase — automatically, without a manual command. The scanner does not drift with the session. It stays anchored to observed scan outcomes.

It does not get smarter every time. It builds calibration signal every time. That is a more accurate claim, and the distinction matters.

What the Skill Does

Install:

cp -r claude-skills/slop-detector ~/.claude/skills/slop-detector
# restart Claude Code

Four slash commands become available:

Command	What it does
`/slop`	Full project scan — interprets findings, prioritizes fixes, proposes patch plan
`/slop-file [path]`	Per-file deep-dive — explains each metric, gives concrete fix per pattern
`/slop-gate`	Hard gate decision — PASS or FAIL, lists blocking files with deficit_score >= 70
`/slop-spar`	Adversarial validation — probes metric boundaries, catches calibration drift

The intended workflow inside a Claude session:

1. /slop               → baseline scan, identify top offenders
2. review findings     → Claude prioritizes by deficit_score
3. patch files         → fix patterns with Claude's help
4. /slop-file <path>   → verify improvement per file
5. /slop               → confirm project aggregate improved
6. /slop-gate          → gate decision before merge

Quality policy lives in the skill layer. You do not re-explain what CRITICAL_DEFICIT means or which patterns are critical on every session.

The LEDA Flywheel

This is the part that matters.

LEDA is not model retraining. It is bounded weight calibration based on repeated scan outcomes.

/slop runs slop-detector --project . --json — without --no-history. Every invocation auto-records results to ~/.slop-detector/history.db, tagged with a project_id (sha256 of cwd) so signals never mix across different repositories.

After every 10 re-scanned files, the tool runs the LEDA self-calibration loop automatically:

/slop called
    │
    ├─► scan result → recorded to history.db (project-scoped)
    │
    ├─► 10 re-scanned files milestone?
    │       └─► SelfCalibrator: 4D grid-search over run history
    │               (ldr × inflation × ddc × purity weights)
    │               └─► confidence gap > 0.10?
    │                       └─► .slopconfig.yaml updated silently
    │
    └─► next /slop → calibrated weights, sharper detection

The calibrator uses re-scanned files as signal — not raw record count. A file counts toward the milestone only when the tool has seen it improve or degrade across at least two runs. This prevents first-time project scans from triggering calibration on noise.

Three constraints keep calibration bounded:

Domain-anchored — grid search is constrained to ±0.15 around domain baseline weights. Detection cannot drift outside the meaningful range for your project type.
Confidence gate — only applies when the top candidate weight set beats the second by > 0.10. Ambiguous signals produce no change.
Drift warnings — CalibrationResult.warnings flags any dimension that shifted > 0.25 from the anchor.

/slop-spar adds a separate adversarial layer: it probes known-pattern anchors, metric boundary cases, and existence conditions. When it detects that measured behavior has diverged from metric claims, it recommends --self-calibrate --apply-calibration explicitly.

What the Data Shows — and What We Won't Claim

We will not tell you that AI-SLOP Detector improves code quality by X%.

We have not run a controlled study. We have not compared matched projects with and without the tool. Any number we put here would be a claim we cannot prove, and this tool is built specifically to catch that kind of thing.

What we do have: the tool scanning itself. Every time a core module was changed, it got re-scanned. N = 14,367 records across all projects in ~/.slop-detector/history.db.

This is not outcome evidence. It is workflow telemetry. Here is what the scan history shows for the eight most-improved files in this codebase:

File                   Scans  Worst → Best   Improvement
─────────────────────────────────────────────────────────
ddc.py                   86   87.8 →  11.0    -76.8 pts
placeholder.py           92   70.3 →   0.0    -70.3 pts
cross_file.py            89   70.3 →   5.0    -65.3 pts
ci_gate.py               88   69.3 →   6.2    -63.1 pts
cli.py                   88   68.4 →   8.4    -60.0 pts
ldr.py                   90   58.0 →   0.1    -57.9 pts
python_advanced.py       95   74.0 →  18.0    -55.9 pts
context_jargon.py        86   55.7 →   5.0    -50.7 pts
─────────────────────────────────────────────────────────
Source: self-scan, history.db — not an independent study

And the weekly project aggregate (avg deficit score):

Week      Avg Deficit   Critical Files   Note
────────────────────────────────────────────────────────
2026-W09     11.9            3           baseline
2026-W10     22.1           20           structural refactor spike
2026-W14     20.0           58           large feature addition
2026-W15     11.9           14           post-refactor recovery
2026-W17     12.2           13           current — stable CLEAN state
────────────────────────────────────────────────────────

The mechanism is not mysterious. Scan reveals structural problems → Claude sees exact pattern names and line references → Claude (or the developer) fixes them → rescan confirms improvement → LEDA registers the delta and adjusts detection weights accordingly.

The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.

Whether that loop improves your codebase is something your history.db will tell you — not us.

Also in v3.6.0

CI gate exit code fix. --ci-mode hard without --ci-report was returning exit 0 even on CRITICAL_DEFICIT files — a two-line fix in _evaluate_ci_gate() (commit 0d67997). This affected v3.1.1 through v3.5.0 on the specific path of using the gate without the reporting flag. A regression test at the subprocess level was added to prevent recurrence (commit 0208af4).

Pre-commit hooks rewritten. Three hook variants now use python -m slop_detector.cli as entry point (bypasses Windows .exe wrapper exit-code issue), and --severity high (nonexistent flag) replaced with --ci-mode:

repos:
  - repo: https://github.com/flamehaven01/AI-SLOP-Detector
    rev: v3.6.0
    hooks:
      - id: slop-detector           # hard gate
      # - id: slop-detector-warn    # report only
      # - id: slop-detector-patterns  # fast per-file

VS Code Extension v3.6.0. Version tracks core library. No behavior changes from v3.5.0.

The Shape of the Loop

The skill + LEDA loop is the external reference point. Detection weights stay grounded in observed scan outcomes — files that improved across re-scans, files that stayed problematic — rather than in what the assistant believes is correct at any given moment.

The loop does not guarantee quality. It makes quality visible, then measurable, then improvable.

We won't tell you what percentage your code will improve. That would make us the thing we are trying to detect.

The scanner is not Claude's opinion about code quality. It is a measurement that gets calibrated against reality, session by session. Your history.db will tell you the rest.

Links:

When an AI Pipeline Passes — But One Path Still Must Be Held: EXP-034

Kwansub Yun — Mon, 27 Apr 2026 10:09:19 +0000

No efficacy, causal, or clinical claims are made in this report.
RExSyn is an experimental Bio-AI governance pipeline.

You do not need to know the earlier experiments to read this report.

Most AI pipeline reports ask one question:

Did the system pass?

EXP-034 asked a stricter one:

Which path was allowed to count?

That distinction matters.

In a multi-stage AI pipeline, a final PASS can hide a lot of unresolved risk. A branch may be unstable. A regeneration path may drift. A new external API may enter the chain without being governed. A new modality may appear to improve the system while quietly changing the basis of judgment.

So EXP-034 was not designed to produce a clean success story.

It was designed to separate three things:

Path	Status	Meaning
Anchored expansion path	`GO`	Accepted path for EXP-034 reporting
Current regeneration path	`HOLD`	Diagnostic evidence, not acceptance baseline
Next remediation cycle	`EXP-035`	RCA and repair target

That is the real result.

EXP-034 passed, but not because every path passed.

It passed because the accepted anchor remained stable, the expansion tracks did not break the judgment system, and the unresolved regeneration path was explicitly held instead of being silently mixed into acceptance.

What EXP-034 tested

EXP-033 had already established a parity baseline.

EXP-034 asked whether that baseline could survive controlled expansion while adding:

a modal update track,
a live AlphaFold EBI observer endpoint,
and AlphaGenome / AG measurement.

The operating rule was simple:

Reproduce the parity baseline first.
Only then allow expansion.
Only then compare governance behavior across experiment cycles.

If the parity anchor breaks, the rest is not expansion.

It is regression.

The scope was also locked: methodology, governance, and reproducibility only. The experiment did not claim biological efficacy, causal inference, or clinical recommendation.

That boundary is important because this kind of system can easily sound more powerful than what was actually measured. EXP-034 was not asking whether the pipeline discovered a better biological answer.

It was asking whether the judgment system stayed governable after new signals entered the chain.

The key split: PASS did not mean everything passed

Track-A produced the defining decision of the experiment.

The accepted legacy replay anchor preserved the required PASS/BLOCK separation:

Metric	Legacy replay anchor
sample accuracy	`1.0`
sample balanced accuracy	`1.0`
arm accuracy	`1.0`
arm balanced accuracy	`1.0`
dangerous false-pass rate	`0.0`
false reject rate	`0.0`

That was the path allowed to anchor EXP-034.

But the current regeneration path did not recover:

Metric	Current regeneration
sample accuracy	`0.5`
sample balanced accuracy	`0.5`
status	`HOLD`

This is the most important part of the experiment:

EXP-034 did not pretend the regeneration path passed.

It kept that result inside the experiment as diagnostic evidence, but did not allow it to redefine the accepted baseline.

That separation is not a minor operational detail. It is the governance result.

A weak pipeline would have blended the two paths and still reported a final success. EXP-034 did the opposite. It allowed the stable anchor to proceed and held the unstable path for RCA.

That is how a stage-gated system avoids changing its own question after seeing the result.

Why path splitting matters

The concrete governance problem is this:

A pipeline can pass for the wrong reason.

valid_report = stable_anchor × traceable_extension × contained_instability

If the anchor is not stable, the report cannot be trusted.

If the extension is not traceable, the new signal becomes an ungoverned side channel.

If instability is not contained, a diagnostic failure can quietly contaminate acceptance.

A single final PASS is not enough when several branches contribute to a verdict. You need to know which branch produced the accepted decision, which branch failed, which branch was only diagnostic, and which branch is allowed to affect future work.

EXP-034 passed because all three conditions were enforced:

the legacy replay anchor held,
the new observer and AG paths were measured under governance,
and the regeneration HOLD remained outside acceptance.

That is the difference between a pipeline that merely outputs a verdict and a pipeline that controls which verdicts are allowed to count.

Adding AlphaFold EBI as an observer, not a predictor

Relative to EXP-033, EXP-034 added a live AlphaFold Protein Structure Database / EBI observer line.

This was not promoted into a primary predictor.

It was wired as an observer/reference oracle and traced into governance as ebi_g2.

The result:

Check	Result
AlphaFold EBI direct endpoint for `P23219`	`GO`
Stage 7 observer tests	`2 passed`
`ebi_g2` governance traceability	`PASS`
`BLOCKED_IDP` mapping path	validated in test

The point is not simply that an external endpoint responded.

The point is that the external signal entered the system through a governed path. It was not allowed to float beside the pipeline as informal context.

EXP-034 tested whether the new observer could be admitted without becoming an ungoverned side channel.

AG-live: non-degradation, not repair

Track-C tested a simple question:

If AG-live enters the pipeline, does it change the final decision?

The answer was no.

AG-live did enter the pipeline.

The AlphaGenome field was present with:

AG field	Value
source	`alphagenome_api_live`
pathogenicity_score	`0.5`
confidence	`0.7143`
clinical_significance	`uncertain`

These are sanitized branch artifact values, not implementation code or full raw artifacts.

AG-live did not change classification.

Both controls remained governed by the same conservative decision boundary:

Path	Expected	Observed	Interpretation
`EXP032-BLOCK-001` negative control	`BLOCK_EXPECTED`	`BLOCK / ESCALATE`	fail-closed behavior preserved
`EXP032-PASS-001` pass-eligible control	`PASS_ELIGIBLE`	`BLOCK / ESCALATE`	conservative over-blocking persisted

That is the key nuance.

AG-live did not create a dangerous false-pass. The negative control stayed blocked.

But AG-live also did not repair the current regeneration hold. The pass-eligible control still failed to recover and remained blocked under R2_component_floor.

The governance surface moved slightly, but the verdict did not:

Metric	Earlier AG branch	AG-live branch
`p_e2e`	`0.0912`	`0.0947`
clinical status	`BLOCK`	`BLOCK`
rule	`R2_component_floor`	`R2_component_floor`

So the correct conclusion is not:

AG improved the pipeline.

The correct conclusion is:

AG-live changed the measurement surface slightly, but did not change the decision boundary.

That is exactly what non-degradation means here.

It preserved fail-closed behavior on the negative control while leaving the pass-eligible control over-blocked.

This is why Track-C can only be called non-degradation, not repair.

Contract passed, but governance still blocked

One of the most useful details in EXP-034 is that the contract layer and governance layer did not collapse into one verdict.

The contract inspection reported:

Field	Value
pipeline contract score	`0.9077`
weakest connection	`C2`
dangerous pass risk	`0.0`
gate recommendation	`PASS`
overall OK	`true`

But the clinical governance layer still blocked the case.

That is not a contradiction.

It means the pipeline connection was valid enough to inspect, but the decision was not safe enough to accept.

This distinction matters.

A weaker system might treat a passing contract as permission to pass the whole output. EXP-034 did not do that. It allowed the contract layer to say:

The pipeline is connected.

while the governance layer could still say:

The claim should not pass.

That separation is exactly what a governance layer is supposed to preserve.

Cross-cycle comparison: EXP-032 → EXP-033 → EXP-034

Track-D compared the accepted anchor path across cycles.

You do not need the earlier experiments as background. They matter here for one reason only:

EXP-034 was not allowed to invent a new success criterion.

EXP-032 and EXP-033 provided the previous PASS/BLOCK baseline. EXP-034 tested whether that baseline survived expansion.

The classification baseline stayed fixed:

Compare	Accuracy / balanced accuracy
EXP-032 → EXP-034	`1.0 / 1.0`
EXP-033 → EXP-034	`1.0 / 1.0`

At the same time, governance signals moved:

Governance signal	Delta
`ccge_p_e2e_mean`	`+0.04447488775996111`
`nnsl_sr9_tech_mean`	`+0.04692394788063081`
`nnsl_di2_tech_mean`	`-0.03667940951579321`

The interpretation is narrow:

The judgment baseline stayed fixed while the governance surface became more measurable.

That is what EXP-034 was allowed to claim.

It did not prove biological efficacy.

It did not prove that every branch of the system was now stable.

It proved that controlled expansion could happen without breaking the accepted PASS/BLOCK baseline.

Stage-gate result

EXP-034 ended with all five stage gates passing:

Gate	Status
G1 parity	`PASS`
G2 reproducibility	`PASS`
G3 cross-experiment compare	`PASS`
G4 governance traceability	`PASS`
G5 extension safety	`PASS`

Final state:

Field	Value
overall status	`PASS`
anchor mode	`legacy_replay`
first failed gate	`null`
diagnostic hold	`Track-A current regeneration`

This is the important nuance:

The experiment passed with a retained diagnostic hold.

That is not a contradiction. It is the point of the control system.

The accepted anchor path was allowed to proceed. The current regeneration path was not. The remediation target was moved to EXP-035.

That separation is the actual proof EXP-034 provides: not that every branch became stable, but that instability was not allowed to contaminate acceptance.

What EXP-034 actually showed

EXP-034 did not show that the entire pipeline is now stable.

It showed something narrower and more useful:

A method-locked Bio-AI governance pipeline can admit modal expansion, AlphaFold EBI observer wiring, and AG-live measurement without losing its accepted PASS/BLOCK baseline — while keeping the unstable regeneration path out of acceptance.

Track-C sharpened that conclusion.

AG-live entered.

Metrics moved slightly.

The verdict did not change.

Dangerous false-pass did not appear.

Conservative over-blocking remained.

That is not a clean success story.

It is a governed result.

Closing

Stage-gated experimentation is not just about getting a result.

It is about deciding whether the result should be allowed to exist.

In EXP-034, the answer was:

GO   for the anchored expansion path
HOLD for current regeneration
NEXT for EXP-035 remediation

That may sound less dramatic than a clean success story.

But in governance work, that is exactly the point.

A mature AI pipeline is not the one that claims everything passed.

It is the one that can say:

This path passed.

This path did not.

And we did not mix them.

FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack

Kwansub Yun — Mon, 20 Apr 2026 14:27:37 +0000

FLAMEHAVEN FileSearch: Why This RAG Engine Feels Different from the Usual Stack

RAG is no longer an exotic idea.

At this point, most developers have seen the familiar stack:

parser
chunker
embeddings
vector store
LLM
framework wrapper
demo query

That is not the interesting part anymore.

The interesting part is what happens after the diagram:
how much infrastructure the stack quietly demands, how much of the retrieval path is actually auditable, how much of the system is still mechanical rather than opaque, and how much operational tax the user is forced to absorb just to get a search engine running.

That is where FLAMEHAVEN FileSearch gets more interesting than the usual "another RAG repo" framing.

This is not a feature announcement. It is a technical look at what the project is actually doing differently.

The real problem with many RAG stacks

A lot of RAG systems are not products. They are assembly instructions.

They give you flexibility, but they also leave you responsible for stitching together:

file parsing
chunking strategy
embeddings
lexical retrieval
semantic retrieval
answer generation
attribution
storage
auth
monitoring
caching
deployment

That is fine if you want a blank canvas.

It is less fine if what you actually want is a document search engine that can be deployed without turning the setup itself into a second project.

That is the first reason this repo feels different: it is trying to compress more of that surface area into one codebase.

What is technically different here

1) Hybrid retrieval is treated as the baseline, not the upgrade path

A lot of RAG repos still behave as if semantic retrieval is the main event and lexical matching is an optional add-on.

That is backwards for real document systems.

FLAMEHAVEN FileSearch builds around three explicit modes:

keyword
semantic
hybrid

The interesting part is the hybrid path itself.

The retrieval stack combines:

BM25
Reciprocal Rank Fusion (RRF)
a Korean + English tokenizer
a lazy per-store BM25 rebuild path

That last point matters more than it sounds. The BM25 index is not eagerly rebuilt on every upload. It is marked dirty (_bm25_dirty) and rebuilt on first hybrid search after mutation. That is a very practical decision. It keeps ingestion cheaper without pretending indexing is free.

This is one of the deeper differences from many vector-first RAG demos: the system does not assume semantic retrieval should dominate exact-match behavior. It assumes production search needs both.

2) The indexing model is not just "document in, chunks out"

The second meaningful difference is the indexing granularity.

This repo introduces a KnowledgeAtom layer: a two-level indexing model with

file-level documents
chunk-level atoms

Those chunk atoms are not anonymous fragments. They carry stable fragment URIs of the form:

local://store/encoded_path#c0001

That design solves two very common problems at once:

precision retrieval
stable attribution

The file-level object remains available, but the system can also retrieve chunk-level units directly. That reduces the usual gap between "the document matched" and "the relevant passage was actually isolated."

The URI choice matters too. A lot of local-first search code still uses basename-style references that collide the moment two files share a name. This repo moves to a reversible, quoted absolute-path-based URI namespace (urllib.parse.quote(abs_path, safe='')), which is much less fragile.

That is not marketing polish. That is retrieval hygiene.

3) The chunking path is internal, structured, and mechanical

Another place where this codebase differs is that it does not outsource the core text pipeline by default.

Instead of treating chunking as a thin wrapper around an external library, it implements an internal text chunker with:

heading-boundary splitting
paragraph splitting
sentence fallback for oversized blocks
undersized chunk merging (default minimum: 64 tokens)
token-aware chunk sizing

The chunking system is actually two-pass under the hood. The structure-aware TextChunker handles the document splits above. On top of that, KnowledgeAtom applies a second windowing pass when generating chunk embeddings — 800-character windows, 120-character overlap, and an 80-character minimum before a fragment is dropped. These two paths are separate by design: TextChunker is responsible for semantic structure, KnowledgeAtom for granular embedding units.

The engine also ships a ContextExtractor — a sliding-window utility that can enrich each chunk with text from its neighboring chunks before retrieval. It is fully tested, but it is not yet wired into the default ingestion path. It is available for downstream pipeline extension.

So the pipeline architecture is:

text document → structure-aware split (TextChunker) → chunk atom embedding (KnowledgeAtom, 800-char windows) → multi-level indexing → retrieval

That is a better-shaped pipeline for document search than a naive chunk list.

4) The vector path is trying to remove operational weight, not add it

This is probably the most unusual architectural choice in the repo.

Instead of anchoring everything around a heavyweight embedding model stack, the project uses Gravitas Vectorizer v2.0, a deterministic vectorization path built on:

hybrid feature extraction (word tokens + character n-grams)
signed feature hashing for collision mitigation
SHA-256 based deterministic output
no torch, no transformers, no model download

The trade-off is obvious: this is not trying to win a leaderboard as a giant foundation-model embedding backend.

That is not the point.

The point is that it makes the semantic path much cheaper to deploy, easier to reason about, and viable in environments where "just load another model" is operationally the wrong answer.

Technically, that shows up in several ways:

deterministic vector generation
cold start under 1ms
no ML framework dependency in the core vector path
optional NumPy acceleration with pure-Python fallback

In other words, the semantic layer is being treated as infrastructure, not as a permanent excuse to expand infrastructure.

That is rare.

5) The repo is explicit about local-first and multi-provider execution

A lot of document search systems quietly assume one provider path.

This repo does not.

The provider layer supports:

Gemini
OpenAI
Anthropic
Ollama
OpenAI-compatible endpoints

That matters for two reasons.

First, it keeps the system from being hardwired to one hosted model assumption.

Second, it means the retrieval stack and the answer stack are not collapsed into the same dependency decision.

That is an important architectural separation.

For non-Gemini providers, the code takes a provider-RAG route: local semantic retrieval first, then prompt construction, then model answer generation. That is a much more honest design than pretending all providers support the same retrieval semantics natively.

The local Ollama path is especially relevant. Not because "local" is fashionable, but because self-hosted document search is often most attractive precisely when data boundary control matters more than marginal model quality gains.

6) The codebase has been refactored toward narrower responsibilities

One of the easiest ways to tell whether a repo is becoming more operationally serious is to look at whether the core orchestrator is shrinking or swelling.

Here, the architecture moved in the right direction.

The central core.py was split into focused mixins:

IngestMixin
LocalSearchMixin
CloudSearchMixin

That is not just aesthetic cleanup.

It clarifies the system boundary between:

ingestion
local retrieval/orchestration
provider-backed answer generation

The same pattern appears elsewhere:

BackendRegistry maps file extensions to parser classes via register() — new formats plug in without modifying existing dispatch logic
duplicate helper blocks were pulled out of cloud search paths
file parsing was reduced to dispatch instead of a single giant extractor module

These changes do not make a flashy screenshot.

They do make the code easier to maintain without quietly reintroducing the same complexity elsewhere.

That is a real engineering improvement.

Benchmark snapshot

System profile

Gravitas Vectorizer v2.0 (deterministic DSP, zero ML deps)

ChronosGrid vector backend with quantized storage (int8)

BM25 + RRF hybrid retrieval

Local / pgvector backends

Redis cache optional

Documented performance figures (Docker, Apple M1, 500 PDFs ~2GB)

Vector generation: <1ms

Search, cache hit: 9ms

Search, cache miss (includes Gemini API round-trip): 1,250ms

Batch search (10 queries, parallel): 2,500ms

Upload, 50MB file with indexing: 3,200ms

What matters more than the numbers

The cache-hit figure reflects the full path when semantic and lexical retrieval are served from warm indexes.

The cache-miss figure is dominated by the Gemini API round-trip, not local retrieval.

The performance story here is not just raw speed. It is that the repo achieves low-latency local retrieval by reducing dependency weight and simplifying the vector path, rather than by hiding heavy infrastructure behind abstraction.

A comparison that is actually worth making

The wrong comparison is:

"Is this the best RAG framework?"

That is too vague to be useful.

The better comparison is architectural.

Approach	Main idea	Common weakness	Why this repo differs
Framework-only RAG stack	Compose your own parser, retriever, vector store, and generator	High assembly burden; a lot of operational logic is still your job	This repo packages more of the retrieval, ingestion, attribution, and serving path together
Hosted RAG / SaaS search	Fastest time to first demo	External data boundary, vendor coupling, recurring service assumptions	This repo keeps self-hosted and local-first execution as first-class options
Vector-first DIY pipeline	Semantic retrieval drives everything	Lexical exactness and attribution often become second-class	This repo treats hybrid retrieval as the practical default
FLAMEHAVEN FileSearch	Retrieval + ingestion + serving compressed into one engine	Less of a blank canvas than a raw framework stack	Better fit for teams that want a mechanical, deployable search base instead of another assembly project

That is the actual niche.

Not "RAG but louder."

More like:

RAG with a lower operational tax.

Why this matters now

The RAG field has cooled compared to its peak hype cycle.

That is not a bad thing.

It means the novelty premium is lower, and the real questions are clearer:

Can it be deployed?
Can it run without a side quest in infrastructure?
Can it keep data local?
Can it support both lexical precision and semantic recall?
Can its retrieval behavior be inspected rather than mythologized?

That is why a repo like this becomes more interesting now than it would have been in the most hype-saturated phase of the RAG wave.

When everything is new, wrappers are enough.

When the field matures, the differentiator becomes whether the system removes real engineering burden.

This one is at least trying to solve that problem directly.

What is special about the code, specifically

If I had to reduce the repo's technical distinctiveness to a short list, it would be this:

BM25 + RRF is built in, not bolted on later
KnowledgeAtom indexing gives the system a more precise retrieval unit than document-only search
Stable chunk URIs (local://store/enc_path#c0001) make attribution less fragile
Two-pass chunking — structure-aware TextChunker + char-window KnowledgeAtom embedding pass — keeps the text pipeline mechanical and inspectable
Gravitas Vectorizer v2.0 reduces startup cost and dependency sprawl (zero torch/transformers)
Provider abstraction separates retrieval architecture from model vendor choice
Mixin segmentation and BackendRegistry pattern show a codebase moving away from monolithic orchestration

That is why this repo feels different from the usual RAG stack.

Not because it claims magic.

Because it makes several practical decisions that many RAG repos defer, externalize, or ignore.

The honest boundary

This is not a claim that the repo solves everything.

It does not.

And the codebase itself shows that.

Static inspection still flags complexity hotspots in:

api.py
admin_routes.py
eval_self.py
chronos_grid.py

There are also components that exist in the engine but are not yet connected to the default pipeline — ContextExtractor being the clearest example. The architecture is there; the wiring is not yet complete everywhere.

That is actually a good thing for a write-up like this, because it keeps the claim honest.

The interesting story here is not "perfect codebase."

It is:

a repo with a real architectural point of view, a recognizably lower dependency burden, and code decisions that are meaningfully different from the usual vector-wrapper pattern.

That is a much stronger claim than vague "enterprise-grade RAG" language.

Final take

FLAMEHAVEN FileSearch is interesting because it is not merely trying to make retrieval work.

It is trying to make retrieval:

more mechanical
more local
more attributable
less dependency-heavy
and less painful to deploy

That is a better differentiator than "supports RAG."

Most repositories do.

The more important question now is whether they reduce the actual engineering burden around RAG, or just rearrange it.

This repo is interesting because it appears to reduce some of it in code.

And in a field where many projects now converge into the same parser + vector store + model + wrapper pattern, that is a difference worth paying attention to.

Repository

GitHub: https://github.com/flamehaven01/Flamehaven-Filesearch

AI-SLOP Detector v3.5.0 — Every Claim, Verified Against Source Code

Kwansub Yun — Wed, 15 Apr 2026 06:19:37 +0000

I published a LinkedIn post about AI-SLOP Detector's self-calibration system and download numbers. Someone asked the reasonable question: "Can you actually back that up?"

Yes. Here's the source.

This isn't a feature announcement. It's a line-by-line audit of seven claims against the actual codebase. Every VERDICT links to a real file and real line numbers. The repo is public — go check it yourself.

What was claimed

Claim	Verdict
Every scan is recorded	✅ TRUE
Repeat scans become calibration signal	✅ TRUE
Updates only when signal is strong enough	✅ TRUE
Visible policy artifact (`.slopconfig.yaml`)	✅ TRUE
Explicit numeric limits govern calibration	✅ TRUE
Detects empty/stub/phantom/disconnected code	✅ TRUE
~1.4K downloads last week	✅ TRUE

All seven. No fabrications. No inflated numbers. Here's the proof.

Claim 1: "Every scan is recorded"

Source: src/slop_detector/history.py, lines 116–180

def record(self, file_analysis, git_commit=None, git_branch=None, project_id=None) -> None:

Auto-invoked on every CLI run. The only opt-out is --no-history. Each scan writes to SQLite at ~/.slop-detector/history.db and stores:

deficit_score, ldr_score, inflation_score, ddc_usage_ratio
n_critical_patterns, fired_rules
git_commit, git_branch, project_id

Schema is now at v5, auto-migrated on startup through every release from v2.9.0 to v3.5.0.

VERDICT: TRUE. The record() call is real. The schema is versioned. The behavior is not optional.

Claim 2: "Every re-scan becomes signal"

Source: src/slop_detector/history.py, lines 221–246

def count_files_with_multiple_runs(self, project_id=None) -> int:
    # Only files scanned >= 2 times count as calibration events
    SELECT file_path FROM history GROUP BY file_path HAVING COUNT(*) >= 2

Source: src/slop_detector/ml/self_calibrator.py, lines 301–309

def _extract_events(self, project_id=None):
    rows = self._load_history(project_id=project_id)
    by_file = self._group_runs_by_file(rows)

Single-scan files produce no calibration events. Only repeat scans generate improvement or fp_candidate labels. The threshold is hardcoded in SQL, not assumed.

VERDICT: TRUE. The repeat-scan requirement is enforced at the query level, not in documentation.

Claim 3: "Updates only when the signal is strong enough"

Source: src/slop_detector/ml/self_calibrator.py, lines 37–54 (constants) and 251–262 (enforcement)

CONFIDENCE_GAP: float = 0.10   # min gap between #1 and #2 candidate
MIN_IMPROVEMENTS: int = 5       # improvement events required
MIN_FP_CANDIDATES: int = 5      # fp_candidate events required

Gate 1 — confidence gap check (line 251):

if result.confidence_gap < CONFIDENCE_GAP:
    result.status = "insufficient_data"
    result.message = (
        f"Confidence gap {result.confidence_gap:.4f} < {CONFIDENCE_GAP}. "
        f"Candidates are too close — need more history data for reliable calibration."
    )
    return result  # NO UPDATE APPLIED

Gate 2 — score delta check (line 262):

if current_score - winner_score < 0.02:
    result.status = "no_change"  # also does not apply

Two independent guards. Both must pass before any weight update applies.

VERDICT: TRUE. Ambiguous signal is rejected twice before touching configuration.

Claim 4: "Leaves behind a visible policy every time it changes"

Source: src/slop_detector/ml/self_calibrator.py, docstring line 17–18

Return CalibrationResult; optionally write to .slopconfig.yaml via --apply-calibration

When --apply-calibration is passed and status == "ok", optimal weights are written to .slopconfig.yaml. Plain-text YAML. Human-readable. Git-versionable. Every calibration change is a diff.

VERDICT: TRUE. The policy artifact is explicit. You can git blame it.

Claim 5: "Explicit limits govern calibration"

Source: src/slop_detector/ml/self_calibrator.py, lines 37–54

MIN_W: float = 0.10             # minimum allowed weight per dimension
MAX_W: float = 0.65             # maximum allowed weight per dimension
MAX_PURITY_WEIGHT: float = 0.25 # purity ceiling
DOMAIN_TOLERANCE: float = 0.15  # max per-dimension deviation from domain anchor
DOMAIN_DRIFT_LIMIT: float = 0.25 # warn when optimal weight drifts this far
GRID_STEP: int = 20             # 0.05 increment resolution

No ML model. No learned bounds. Every constraint is a named constant with a comment explaining why it exists. The calibration space is a bounded grid, not an open optimization landscape.

VERDICT: TRUE. Every limit is auditable. Nothing is opaque.

Claim 6: "Detects empty implementations, phantom dependencies, disconnected pipelines"

These are the three canonical defect patterns AI code generation produces at scale. Each has a dedicated module.

Defect class	Implementation
Empty/stub functions	`src/slop_detector/metrics/ldr.py` — LDRCalculator detects `pass`, `...`, `raise NotImplementedError`, `TODO`
Phantom/unused imports	`src/slop_detector/metrics/hallucination_deps.py` — AST-based import vs usage analysis via `HallucinatedDependency` dataclass
Disconnected pipelines	`src/slop_detector/metrics/ddc.py` — DDC (Declared Dependency Completeness) usage ratio
Function clone clusters	`src/slop_detector/patterns/python_advanced.py` — Jensen-Shannon Divergence on 30-dim AST histograms, JSD < 0.05 = clone

The clone detection is worth noting. JSD on AST histograms catches structural duplication that string similarity misses entirely. LLMs produce a lot of this — same function logic, slightly renamed.

VERDICT: TRUE. Each defect class has a named module with a working implementation.

Claim 7: "~1.4K downloads in the past week"

Source: pypistats.org API (mirrors=false), queried 2026-04-15

last_week:  1,407  (mirrors excluded — actual pip install traffic)
last_month: 1,787
last_day:   83

"~1.4K" is within 0.5% of 1,407. Mirrors excluded means bot traffic is stripped — these are real install invocations.

VERDICT: TRUE. Verified against pypistats in real time. The number is not rounded up.

Why this format exists

Most open-source project posts make claims. Few back them up with file paths and line numbers.

That gap is the same problem AI-SLOP Detector is built to close. AI-generated code makes claims too — functions that look complete, imports that look used, pipelines that look connected. Static analysis finds the gap between what the code says and what it does.

This post applies the same standard to the project's own marketing copy. If a claim can be verified, it should be. If it can't, it shouldn't be made.

The codebase is public: github.com/flamehaven01/AI-SLOP-Detector

Pull requests welcome. Audits welcome more.

Verified by static code analysis + pypistats API, 2026-04-15

DEV Community: Kwansub Yun

Making Equation (2.2) of the OpenAI Erdős Result Executable

Why a proved theorem still needs reproducible claim custody

The proof is not the artifact

The specific target: equation (2.2)

Where the numerical fragility comes from

What we built

From reproduction to custody

Closing

The README Was a Protocol. The Entrypoint Was Still Optional.

Glossary: terms used in this article

1. Where Part 6 Left Off

2. The Gap README-as-Protocol Left Open

3. The Answer: An Invocation Hierarchy

4. What Changed in Code

5. STEM-BIO-AI: The Cleaner Case

6. CCGE: The Harder Case

7. What This Means for Anyone Building Agent Workflows

8. What This Does Not Claim

9. What Part 8 Will Address

From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

1. Problem: The scanner was still too Python-shaped

What changed in 1.7.6

A small before/after that makes the point

Why that matters

2. Problem: The warning lanes were doing too many jobs at once

What changed in 1.7.7 and 1.7.8

The code insight

3. Problem: The report could still be correct and yet hard to trust

First: AIRI numbers needed explanation, not just display

Second: The packets themselves needed re-architecture

Why that matters

The hidden pattern behind all three changes

The more interesting lesson

What I would tell developers evaluating this line

Try It

See the Artifact

Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x

Basic AIRI Context

What Problem AIRI Was Meant to Solve

A Short Note on Detector Families

AIRI Does Not Replace the Scan

From External Repository to Local Governance Layer

What “Governed” Means Here

The Curation Logic

Example: Detector Hit to AIRI Domain Vocabulary

Why Local Provenance Matters

What Is Implemented in the Current 1.7.5 State of 1.7.x

Coverage Is Not a Safety Percentage

What Changed in 1.7.x

1.7.0

1.7.1

1.7.2

1.7.3

1.7.4

1.7.5

What This Still Does Not Do

Where This Could Go

Final Thought

References and Acknowledgment

When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x

A Short Context for New Readers

What Is Actually Implemented in the Current 1.7.5 State of 1.7.x

The Pressure That Causes Drift

Calibration Is Not a Tuning Console

The 1–5 Scale Is Input, Not Authority

What the CLI Is Allowed to Do

What Actually Gets Verified

The B2 Tightening Example

Why Stage 4 Stays Separate

Advisory AI Uses the Same Boundary

From Scoring Tool to Audit Workflow

What This Still Does Not Do

The Next Version Direction

Final Position

Building a Deterministic Governance Kernel: Separating Custody from Truth

The Problem with “Agentic” Governance

The Architecture: A Deterministic Data Flow

The API Boundary

What changed in `1.7.6`

What changed in `1.7.7` and `1.7.8`

The `1–5` Scale Is Input, Not Authority