Three technical changes that made the scanner less Python-shaped, the warning model more stable, and the reports more inspectable.
The last time I wrote about STEM BIO-AI, the focus was AIRI:
how a local repository scanner could expand its risk vocabulary without pretending to become a universal AI safety judge.
That was the right story for 1.7.0 and 1.7.1.
But the project changed meaningfully after that.
For readers who have not followed the earlier posts: Beyond Repo Scanning: How AIRI Expanded the Risk Vocabulary in STEM BIO-AI 1.7.x
By 1.7.8, the interesting question was no longer just:
Can this scanner attach a broader risk language to local findings?
It became:
Can this scanner make those findings more inspectable, less misleading, and more robust across real repository shapes?
That shift matters.
Because in audit tooling, correctness is only the first battle. The second battle is whether a reviewer can see why the tool landed where it did, and whether the output still makes sense when it leaves the terminal and becomes a report, a PDF packet, a Hugging Face demo, or a governance memo.
From 1.7.6 through 1.7.8, three changes mattered most.
They changed:
- what counts as evidence,
- how warning lanes are separated,
- and how the final artifact stays legible across surfaces.
This is the more technical story behind those releases.
Basic AIRI(the AI Risk Repository) Context: Expanding the Language of Risk
Before getting into the release details, it helps to define what AIRI means in this series.
AIRI refers here to the MIT AI Risk Repository: a public AI risk resource from the MIT AI Risk Initiative that organizes fragmented AI risk language across research, policy, and industry sources.
The repository includes an AI Risk Database, a Causal Taxonomy of AI Risks, and a Domain Taxonomy of AI Risks. According to the MIT AI Risk Repository site, the database collects 1,700+ risks from 74 existing AI risk frameworks and classifications, while the public domain taxonomy organizes risks into 7 domains and 24 subdomains.
That makes AIRI useful as a vocabulary source.
But vocabulary is not truth.
A local scanner should not say:
this repository caused this risk.
It should say something more careful:
this local finding belongs to a broader class of AI risk language.
That distinction is the design boundary.
1. Problem: The scanner was still too Python-shaped
One of the more useful failures in this line came from an uncomfortable result: a repository could obviously have dependency and lockfile evidence, and STEM BIO-AI could still miss it.
That is not a philosophical problem.
That is an implementation problem.
In practice, the project was still too biased toward Python-native signals.
That showed up most clearly in JavaScript or mixed-stack repositories:
package.jsonpackage-lock.jsonpnpm-lock.yamlyarn.locknpm-shrinkwrap.json
were not being treated as first-class provenance and replication evidence in the same way that requirements.txt or pyproject.toml were.
The result was a false negative pattern:
- Stage 3 provenance (
B1) could be undercounted - Stage 4 replication evidence could be undercounted
- and the report could quietly imply "no dependency evidence" when the repository clearly had dependency structure
That kind of miss is more dangerous than it sounds.
Not because it makes the score a little wrong.
But because it damages trust in the scanner's worldview.
If developers see a tool miss an obvious pnpm-lock.yaml, they stop believing the harder claims too.
What changed in 1.7.6
The fix was straightforward but important:
- JavaScript manifests and lockfiles were promoted into the same evidence families as the existing Python manifests where appropriate.
Concretely, that meant:
-
B1_data_provenance_controlsstarted recognizing JS manifest/lock surfaces -
S4_environment_lock_evidencestarted recognizing them -
S4_exact_dependency_pins_or_hashesstarted recognizing them
This was not a scoring philosophy change.
It was a scope correction.
The rule engine learned that a dependency ecosystem is a dependency ecosystem even when it is not Python.
One boundary matters here.
B1_data_provenance_controls does not suddenly mean "dataset lineage was proven by a lockfile."
In this lane, B1 is using dependency manifests as repository provenance surfaces:
- what environment the repository expects,
- what dependency custody the repository exposes,
- and whether the repo surfaces any adjacent data-source, IRB, or dataset-citation language around that environment.
That is weaker than dataset lineage evidence.
But it is also much stronger than pretending a mixed-stack repository has no provenance surface at all.
A small before/after that makes the point
The yorkeccak/bio case is a good example because the score movement was not philosophical. It was mechanical.
Before the JS manifest fix, the same repository could produce:
version: 1.7.5
final_score: 40
stage_3_code_bio: 6
B1_data_provenance_controls: 0 / 15
replication_score: 10
AIRI covered_count: 0 / 31
After the manifest and lockfile correction, the same repository shape produced:
version: 1.7.8
final_score: 48
stage_3_code_bio: 25
B1_data_provenance_controls: 15 / 15
replication_score: 30
AIRI covered_count: 7 / 32
The important part is not the score delta by itself.
One small boundary is worth making explicit here.
The AIRI change is doing two things at once:
- the denominator moved from
31to32because the governed AIRI detector-scope expanded by one mapping row across this release line, - and the numerator moved from
0to7because the current release can now carry more bounded AIRI links around the findings it actually surfaced.
That explains the AIRI coverage delta.
The scoring delta came from a more mechanical correction:
-
package.json,package-lock.json, andpnpm-lock.yamlstopped being invisible, - Stage 3 stopped saying "no dependency/provenance manifest detected,"
- and Stage 4 stopped undercounting replication structure that was obviously there.
That is what I mean by "blind spot removal" rather than score drift.
Why that matters
This is the kind of change that sounds small in a changelog but large in practice.
Because it changes the relationship between the tool and the developer reading it.
A scanner earns the right to say "this repo is weak on provenance" only after it can correctly see the basic surfaces that exist in the target stack.
That correction also made later report outputs more believable.
When B1 moved from 0 to 15 in affected repositories, that was not "score drift." It was the removal of a blind spot.
And that distinction is exactly why audit tools need explicit versioned rationale.
Without it, every score movement looks arbitrary.
2. Problem: The warning lanes were doing too many jobs at once
Before the split, it helps to read C1–C6 as code-integrity lanes.
They are not general AI risk categories. They are reviewer-facing signals that tell you what kind of repository weakness the scanner found, and where to inspect next.
| Lane | What it means in STEM BIO-AI | What a reviewer should inspect |
|---|---|---|
C1 |
Hardcoded credential signals | exposed API keys, cloud keys, tokens, or credential-like patterns |
C2 |
Dependency pinning and external-service fragility | loose dependency ranges, missing exact pins, fragile external service assumptions |
C3 |
Deprecated patient-adjacent paths | legacy, archive, or deprecated folders that still contain patient or clinical-adjacent patterns |
C4 |
Fail-open exception handling |
except: pass, except Exception: pass, silent fallbacks, or code paths where errors can disappear |
C5 |
Compliance and clinical-boundary integrity | unsupported HIPAA, compliance, clinical-safe, self-hosted, or regulatory-adjacent claims |
C6 |
Mock-auth or no-auth local/self-host trust boundaries | auto-login, mock authentication, no-auth flows, or weak local trust-boundary assumptions |
That table matters because C4, C5, and C6 are not interchangeable.
A fail-open exception is not the same problem as an unsupported compliance claim.
And an unsupported compliance claim is not the same problem as a mock-auth self-host boundary.
That distinction became important once the report started surfacing more nuanced governance signals.
The old C4 lane had started life as a code-oriented fail-open/exception surface.
But as the scanner got better at spotting unsupported compliance language and boundary failures, more and more signals were being interpreted near that same lane.
That made the result harder to read.
If a reviewer sees:
C4_exception_handling_clinical_adjacent_paths: WARN
they should be able to infer the remediation class immediately.
They should know to inspect executable control flow.
They should not have to wonder whether the warning is actually about a README compliance claim, a missing clinical boundary, or a mock-auth local path.
Once one lane starts carrying all of those meanings, the ID stops doing its job.
This is a common failure mode in rule systems.
At first it feels efficient:
- one warning lane,
- one bucket,
- multiple related issues.
Then a few releases later the bucket becomes a junk drawer.
That is exactly what had to be prevented here.
What changed in 1.7.7 and 1.7.8
The solution was to split the lane cleanly:
-
C4stayed reserved for executable fail-open exception behavior -
C5was introduced for unsupported compliance or boundary-integrity claims -
C6was introduced for mock-auth, auto-login, or no-auth self-host/local trust-boundary signals
This was more than renaming.
It made the model of the problem cleaner:
-
C4is code-path failure semantics -
C5is governance/claim integrity -
C6is trust-boundary collapse in local or self-host flows
That distinction matters to developers because those are different remediation classes.
If a repository triggers C4, you inspect executable control flow.
If it triggers C5, you inspect public claim surfaces and supporting governance evidence.
If it triggers C6, you inspect local auth and trust-boundary design.
One warning label should not try to be all three.
The more interesting case is when two of those lanes fire together.
A repository can claim something like "HIPAA-ready self-hosting" at the README layer and also expose a mock-auth or auto-login local path.
That is not one problem.
It is two related problems:
-
C5says the claim surface is overstating governance integrity -
C6says the local trust boundary is weaker than the claim suggests
That is exactly why the split matters.
If those two findings collapse into one bucket, the reviewer loses both remediation clarity and causal ordering.
If they stay separate, the report can say:
- the public claim is weak,
- the local boundary is weak,
- and both together make the repository easier to over-trust.
The code insight
This is one of those places where good audit tooling starts looking more like good static analysis design.
A useful warning family is not just one that catches things.
It is one that stays semantically stable across releases.
That is why this split mattered:
it was not just about improving recall.
It was about preserving interpretability under growth.
Once a detector ID becomes ambiguous, your historical comparisons become weaker.
And once historical comparisons become weaker, your audit system starts losing its memory.
That is a bigger problem than one missed warning.
3. Problem: The report could still be correct and yet hard to trust
A repository scanner does not end its life in JSON.
It ends up in:
- Markdown
- HTML
- demos
- governance reviews
- screenshots
- and social arguments
That means the output architecture matters almost as much as the scoring logic.
And there were two places where this became obvious.
First: AIRI numbers needed explanation, not just display
Earlier versions could show AIRI coverage as a count, but not always make it obvious why a covered risk appeared.
That is a problem.
Because a number like 7 / 32 looks precise.
But precision without causal explanation is fragile.
Developers do not just want to know that a risk mapped.
They want to know:
- which detector triggered it,
- why that detector maps to that AIRI risk,
- and what boundary still remains around that mapping.
So the AIRI layer had to become more explicit.
That is where mapping_details mattered.
Covered AIRI rows now carry bounded reasoning objects that can say, in effect:
- detector ID
- mapping justification
- trigger reason
That is a much stronger artifact than a bare coverage count.
It turns AIRI from a visual add-on into an inspectable vocabulary layer.
In practice the object now looks more like this:
{
"id": "24.01.03",
"title": "Safe exploration problem with widely deployed AI assistants",
"covered_by": ["C5_compliance_boundary_integrity"],
"mapping_details": [
{
"detector_id": "C5_compliance_boundary_integrity",
"mapping_justification": "Weak compliance and clinical-boundary integrity can cause users to over-trust unsafe exploration in clinical-adjacent contexts.",
"trigger_reason": "Unsupported legal/compliance claim surfaced in boundary-integrity lane."
}
]
}
That matters because the AIRI layer no longer asks the reviewer to trust a number alone.
It now gives the reviewer a bounded reasoning object to inspect.
Second: The packets themselves needed re-architecture
The PDF tiers had also drifted into an awkward shape.
The old packet boundaries were no longer matching the actual content density:
- Stage 4 could disappear or feel collapsed
- the closeout pages could become overcrowded
- and "5-page detailed packet" could stop meaning what users expected
That led to a cleaner packet model:
-
level 1= brief1p -
level 2= standard5p -
level 3= full7p
And just as importantly:
- the default CLI path moved to
level 3
It is a statement about what the project now considers the normal artifact.
The normal artifact is no longer the brief scan.
It is the full evidence packet.
Why that matters
This is where the project moved from "scanner" toward "audit architecture."
A scanner can stop at a result.
An audit architecture has to preserve meaning across surfaces.
That means:
- JSON must be canonical
- HTML must be navigable
- PDFs must honor real packet boundaries
- and the same warning semantics must survive in all of them
That is why these changes matter to developers.
They are part of the correctness story.
If the why disappears when the result becomes a report, the audit object was never complete to begin with.
The hidden pattern behind all three changes
These releases can look like a mixed bag:
- JS manifest support
- legal/compliance claim surfacing
- external dependency risk
- C4/C5/C6 split
- AIRI reasoning
- packet restructuring
- demo/output alignment
But there is a single pattern underneath them:
the system became less willing to let ambiguity hide inside a convenient surface.
That showed up in three ways:
- a manifest should count if it exists
- a warning lane should mean one thing
- a risk mapping should explain itself
That may sound almost obvious.
But a lot of tools never make it that far.
They accumulate clever features faster than they reduce ambiguity.
This line of work did the opposite.
It made *the system stricter about what its outputs are allowed to imply.
*
That is a more durable path.
The more interesting lesson
The most useful thing about 1.7.6 through 1.7.8 is not that STEM BIO-AI became "smarter."
It is that it became harder to misread.
That is a better goal for audit tooling.
Especially now.
Because in a world increasingly full of fluent agent outputs, the differentiator is not whether a tool can generate a plausible narrative.
It is whether the narrative stays tethered to inspectable structure when the repository is messy, cross-stack, overclaimed, or partially misleading.
That is where this release line got better.
Not by pretending to know more than it does.
But by making its own boundaries clearer.
What I would tell developers evaluating this line
If you only look at the release notes, you might think:
- better AIRI
- more warnings
- nicer reports
That is true, but too shallow.
The real changes are:
- the scanner is less Python-centric than it was
- the warning taxonomy is more semantically stable than it was
- the artifacts are more inspectable than they were
That combination matters more than any one score change.
It means the tool is becoming less of a clever repo grader and more of a reliable evidence instrument.
That is the direction I care about.
Because once the repository is politically messy, clinically adjacent, or governance-sensitive, "good-enough automation" is not enough.
The system has to show its work.
These versions got noticeably better at doing that.
Try It
pip install stem-ai
stem /path/to/repo
If you want the full packet explicitly:
stem scan /path/to/repo --level 3 --format all --explain
The default path now lands on the full evidence packet, and that is the point.
In audit tooling, the serious path should not require an extra flag.
See the Artifact
If you want to inspect the actual artifact shape behind this release line, these two public outputs are the best reference:
- Interactive HTML report: Open interactive HTML report
- Full
7pPDF packet: Open full7pPDF packet
The point of 1.7.8 is not just that the scanner scores the repository differently.
It is that the same result now survives translation into JSON, Markdown, HTML, and a full review packet without losing too much meaning along the way.








Top comments (0)