Kwansub Yun

Posted on May 19 • Originally published at flamehaven.space

From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

#opensource #governance #bioinformatics #devtools

Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.

The last time I wrote about STEM BIO-AI, the focus was AIRI:

how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.

That was the right story for 1.7.0 and 1.7.1.

But the project changed meaningfully after that.

For readers who have not followed the earlier posts: Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x

By 1.7.8, the interesting question was no longer just:

Can this scanner attach a broader risk language to local findings?

It became:

Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?

That shift matters.

Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see why the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.

From 1.7.6 through 1.7.8, three changes mattered most.

They changed:

what counts as evidence,
how warning lanes are separated,
and how the final artifact stays legible across surfaces.

This is the more technical story behind those releases.

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

Before getting into the release details, it helps to define what AIRI means in this series.

AIRI refers here to the MIT AI Risk Repository: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.

The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.

That makes AIRI useful as a vocabulary source.

But vocabulary is not truth.

A local scanner should not say:

this repository caused this risk.

It should say something more careful:

this local finding belongs to a broader class of AI risk language.

That distinction is the design boundary.

1. Problem: The scanner was still too Python-shaped

One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.

That is not a philosophical problem.
That is an implementation problem.

In practice, the project was still too biased toward Python-native signals.

That showed up most clearly in JavaScript or mixed-stack repositories:

package.json
package-lock.json
pnpm-lock.yaml
yarn.lock
npm-shrinkwrap.json

were not being treated as first-class provenance and replication evidence in the same way that requirements.txt or pyproject.toml were.

The result was a false negative pattern:

Stage 3 provenance (B1) could be undercounted
Stage 4 replication evidence could be undercounted
and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure

That kind of miss is more dangerous than it sounds.

Not because it makes the score a little wrong.

But because it damages trust in the scanner's worldview.

If developers see a tool miss an obvious pnpm-lock.yaml, they stop believing the harder claims too.

What changed in `1.7.6`

The fix was straightforward but important:

JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.

Concretely, that meant:

B1_data_provenance_controls started recognizing JS manifest/lock surfaces
S4_environment_lock_evidence started recognizing them
S4_exact_dependency_pins_or_hashes started recognizing them

This was not a scoring philosophy change.

It was a scope correction.

The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.

One boundary matters here.

B1_data_provenance_controls does not suddenly mean "dataset lineage was proven by a lockfile."

In this lane, B1 is using dependency manifests as repository provenance surfaces:

what environment the repository expects,
what dependency custody the repository exposes,
and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.

That is weaker than dataset lineage evidence.

But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.

A small before/after that makes the point

The yorkeccak/bio case is a good example because the score movement was not philosophical. It was mechanical.

Before the JS manifest fix, the same repository could produce:

version: 1.7.5
final_score: 40
stage_3_code_bio: 6
B1_data_provenance_controls: 0 / 15
replication_score: 10
AIRI covered_count: 0 / 31

After the manifest and lockfile correction, the same repository shape produced:

version: 1.7.8
final_score: 48
stage_3_code_bio: 25
B1_data_provenance_controls: 15 / 15
replication_score: 30
AIRI covered_count: 7 / 32

The important part is not the score delta by itself.

One small boundary is worth making explicit here.

The AIRI change is doing two things at once:

the denominator moved from 31 to 32 because the governed AIRI detector-scope expanded by one mapping row across this release line,
and the numerator moved from 0 to 7 because the current release can now carry more bounded AIRI links around the findings it actually surfaced.

That explains the AIRI coverage delta.

The scoring delta came from a more mechanical correction:

package.json, package-lock.json, and pnpm-lock.yaml stopped being invisible,
Stage 3 stopped saying "no dependency/provenance manifest detected,"
and Stage 4 stopped undercounting replication structure that was obviously there.

That is what I mean by "blind spot removal" rather than score drift.

Why that matters

This is the kind of change that sounds small in a changelog but large in practice.

Because it changes the relationship between the tool and the developer reading it.

A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.

That correction also made later report outputs more believable.

When B1 moved from 0 to 15 in affected repositories, that was not "score drift." It was the removal of a blind spot.

And that distinction is exactly why audit tools need explicit versioned rationale.

Without it, every score movement looks arbitrary.

2. Problem: The warning lanes were doing too many jobs at once

Before the split, it helps to read C1–C6 as code-integrity lanes.

They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.

Lane	What it means in STEM BIO-AI	What a reviewer should inspect
`C1`	Hardcoded credential signals	exposed API keys, cloud keys, tokens, or credential-like patterns
`C2`	Dependency pinning and external-service fragility	loose dependency ranges, missing exact pins, fragile external service assumptions
`C3`	Deprecated patient-adjacent paths	legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns
`C4`	Fail-open exception handling	`except: pass`, `except Exception: pass`, silent fallbacks, or code paths where errors can disappear
`C5`	Compliance and clinical-boundary integrity	unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims
`C6`	Mock-auth or no-auth local/self-host trust boundaries	auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions

That table matters because C4, C5, and C6 are not interchangeable.

A fail-open exception is not the same problem as an unsupported compliance claim.

And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.

That distinction became important once the report started surfacing more nuanced governance signals.

The old C4 lane had started life as a code-oriented fail-open/exception surface.

But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.

That made the result harder to read.

If a reviewer sees:

C4_exception_handling_clinical_adjacent_paths: WARN

they should be able to infer the remediation class immediately.

They should know to inspect executable control flow.

They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.

Once one lane starts carrying all of those meanings, the ID stops doing its job.

This is a common failure mode in rule systems.

At first it feels efficient:

one warning lane,
one bucket,
multiple related issues.

Then a few releases later the bucket becomes a junk drawer.

That is exactly what had to be prevented here.

What changed in `1.7.7` and `1.7.8`

The solution was to split the lane cleanly:

C4 stayed reserved for executable fail-open exception behavior
C5 was introduced for unsupported compliance or boundary-integrity claims
C6 was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals

This was more than renaming.

It made the model of the problem cleaner:

C4 is code-path failure semantics
C5 is governance/claim integrity
C6 is trust-boundary collapse in local or self-host flows

That distinction matters to developers because those are different remediation classes.

If a repository triggers C4, you inspect executable control flow.
If it triggers C5, you inspect public claim surfaces and supporting governance evidence.
If it triggers C6, you inspect local auth and trust-boundary design.

One warning label should not try to be all three.

The more interesting case is when two of those lanes fire together.

A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.

That is not one problem.

It is two related problems:

C5 says the claim surface is overstating governance integrity
C6 says the local trust boundary is weaker than the claim suggests

That is exactly why the split matters.

If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.

If they stay separate, the report can say:

the public claim is weak,
the local boundary is weak,
and both together make the repository easier to over-trust.

The code insight

This is one of those places where good audit tooling starts looking more like good static analysis design.

A useful warning family is not just one that catches things.

It is one that stays semantically stable across releases.

That is why this split mattered:

it was not just about improving recall.

It was about preserving interpretability under growth.

Once a detector ID becomes ambiguous, your historical comparisons become weaker.

And once historical comparisons become weaker, your audit system starts losing its memory.

That is a bigger problem than one missed warning.

3. Problem: The report could still be correct and yet hard to trust

A repository scanner does not end its life in JSON.

It ends up in:

Markdown
HTML
PDF
demos
governance reviews
screenshots
and social arguments

That means the output architecture matters almost as much as the scoring logic.

And there were two places where this became obvious.

First: AIRI numbers needed explanation, not just display

Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.

That is a problem.

Because a number like 7 / 32 looks precise.

But precision without causal explanation is fragile.

Developers do not just want to know that a risk mapped.
They want to know:

which detector triggered it,
why that detector maps to that AIRI risk,
and what boundary still remains around that mapping.

So the AIRI layer had to become more explicit.

That is where mapping_details mattered.

Covered AIRI rows now carry bounded reasoning objects that can say, in effect:

detector ID
mapping justification
trigger reason

That is a much stronger artifact than a bare coverage count.

It turns AIRI from a visual add-on into an inspectable vocabulary layer.

In practice the object now looks more like this:

{
  "id": "24.01.03",
  "title": "Safe exploration problem with widely deployed AI assistants",
  "covered_by": ["C5_compliance_boundary_integrity"],
  "mapping_details": [
    {
      "detector_id": "C5_compliance_boundary_integrity",
      "mapping_justification": "Weak compliance and clinical-boundary integrity can cause users to over-trust unsafe exploration in clinical-adjacent contexts.",
      "trigger_reason": "Unsupported legal/compliance claim surfaced in boundary-integrity lane."
    }
  ]
}

That matters because the AIRI layer no longer asks the reviewer to trust a number alone.

It now gives the reviewer a bounded reasoning object to inspect.

Second: The packets themselves needed re-architecture

The PDF tiers had also drifted into an awkward shape.

The old packet boundaries were no longer matching the actual content density:

Stage 4 could disappear or feel collapsed
the closeout pages could become overcrowded
and "5-page detailed packet" could stop meaning what users expected

That led to a cleaner packet model:

level 1 = brief 1p
level 2 = standard 5p
level 3 = full 7p

And just as importantly:

the default CLI path moved to level 3

It is a statement about what the project now considers the normal artifact.

The normal artifact is no longer the brief scan.
It is the full evidence packet.

Why that matters

This is where the project moved from "scanner" toward "audit architecture."

A scanner can stop at a result.

An audit architecture has to preserve meaning across surfaces.

That means:

JSON must be canonical
HTML must be navigable
PDFs must honor real packet boundaries
and the same warning semantics must survive in all of them

That is why these changes matter to developers.

They are part of the correctness story.

If the why disappears when the result becomes a report, the audit object was never complete to begin with.

The hidden pattern behind all three changes

These releases can look like a mixed bag:

JS manifest support
legal/compliance claim surfacing
external dependency risk
C4/C5/C6 split
AIRI reasoning
packet restructuring
demo/output alignment

But there is a single pattern underneath them:

the system became less willing to let ambiguity hide inside a convenient surface.

That showed up in three ways:

a manifest should count if it exists
a warning lane should mean one thing
a risk mapping should explain itself

That may sound almost obvious.

But a lot of tools never make it that far.

They accumulate clever features faster than they reduce ambiguity.

This line of work did the opposite.

It made *the system stricter about what its outputs are allowed to imply.
*
That is a more durable path.

The more interesting lesson

The most useful thing about 1.7.6 through 1.7.8 is not that STEM BIO-AI became "smarter."

It is that it became harder to misread.

That is a better goal for audit tooling.

Especially now.

Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.

It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.

That is where this release line got better.

Not by pretending to know more than it does.

But by making its own boundaries clearer.

What I would tell developers evaluating this line

If you only look at the release notes, you might think:

better AIRI
more warnings
nicer reports

That is true, but too shallow.

The real changes are:

the scanner is less Python-centric than it was
the warning taxonomy is more semantically stable than it was
the artifacts are more inspectable than they were

That combination matters more than any one score change.

It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.

That is the direction I care about.

Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.

The system has to show its work.

These versions got noticeably better at doing that.

Try It

pip install stem-ai
stem /path/to/repo

If you want the full packet explicitly:

stem scan /path/to/repo --level 3 --format all --explain

The default path now lands on the full evidence packet, and that is the point.

In audit tooling, the serious path should not require an extra flag.

See the Artifact

If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:

Interactive HTML report: Open interactive HTML report
Full 7p PDF packet: Open full 7p PDF packet

The point of 1.7.8 is not just that the scanner scores the repository differently.

It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.

DEV Community

From Repo Scanner to Audit Architecture: What Changed in STEM BIO-AI Through v1.7.8

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

1. Problem: The scanner was still too Python-shaped

What changed in `1.7.6`

A small before/after that makes the point

Why that matters

2. Problem: The warning lanes were doing too many jobs at once

What changed in `1.7.7` and `1.7.8`

The code insight

3. Problem: The report could still be correct and yet hard to trust

First: AIRI numbers needed explanation, not just display

Second: The packets themselves needed re-architecture

Why that matters

The hidden pattern behind all three changes

The more interesting lesson

What I would tell developers evaluating this line

Try It

See the Artifact

Top comments (0)

Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk

1. Problem: The scanner was still too Python-shaped

What changed in 1.7.6

A small before/after that makes the point

Why that matters

2. Problem: The warning lanes were doing too many jobs at once

What changed in 1.7.7 and 1.7.8

The code insight

3. Problem: The report could still be correct and yet hard to trust

First: AIRI numbers needed explanation, not just display

Second: The packets themselves needed re-architecture

Why that matters

The hidden pattern behind all three changes

The more interesting lesson

What I would tell developers evaluating this line

Try It

See the Artifact

What changed in `1.7.6`

What changed in `1.7.7` and `1.7.8`