Control slowly becomes authority when nobody marks the boundary.
That is the calibration problem I kept running into while building STEM BIO-AI.
At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.
That was useful.
But it was not enough.
The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.
In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:
- deterministic scoring
- diagnostic findings
- replication evidence
- advisory interpretation
- domain-specific review posture
They all matter.
But they should not all have the same authority.
That is the core reason calibration became a governance problem in the 1.7.x line.
The principle is simple:
easy experimentation, hard drift
STEM BIO-AI should let researchers express review posture. It should let operators simulate policy changes. It should make policy metadata visible in artifacts.
But it should not let those inputs silently mutate the official score.
A Short Context for New Readers
STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.
It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.
It scans observable repository surfaces such as:
- README and docs
- code structure
- CI configuration
- dependency manifests
- changelogs
- evidence and boundary language
The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:
| Stage | Role |
|---|---|
| Stage 1 | README / stated evidence boundary |
| Stage 2R | repo-local consistency |
| Stage 3 | code and bio-responsibility surface |
The active formula still also applies:
-
C1_penaltywhen hardcoded credentials are detected -
score_caport0_hard_floorwhen clinical-adjacent boundary rules require it
Stage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.
That separation is intentional.
What Is Actually Implemented in the Current 1.7.5 State of 1.7.x
Before discussing calibration philosophy, the implementation boundary has to be clear.
In the current 1.7.5 state of the 1.7.x line, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.
This post describes the current released state of the 1.7.x line as of v1.7.5, not a future authoritative-read-through design.
Implemented surfaces include:
- packaged calibration profiles
- schema and runtime validation
- profile identity surfaced in result metadata
stem policy liststem policy explainstem policy derivestem policy simulate- simulation-only local profile files
- profile hashes and read-mode metadata in artifacts
The current named recommendation surface is intentionally narrow:
defaultstrict_clinical_adjacency
reproducibility_first is still a draft posture, not an active release-grade named recommendation.
The important limitation is this:
the authoritative scan scoring path is still protected from arbitrary user-provided profile mutation.
In other words, scan --policy <name> can surface selected profile metadata. policy derive and policy simulate can show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.
More specifically, local profile files are currently accepted only by stem policy simulate, and the CLI rejects them unless the file remains mirror_only.
That is not a missing convenience.
That is the boundary being tested before it is allowed to become authority.
The Pressure That Causes Drift
One question pushed this design forward:
*If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?
*
I do not think the answer is automatically yes.
If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."
That is usually how audit systems drift.
The score stops being a stable artifact and starts becoming a moving interpretation layer.
The danger is not that users want control.
The danger is that control slowly becomes authority without anyone noticing.
So the design question is not:
How do we let people tune the system more freely?
The design question is:
How do we let people express domain judgment without making the formal score easy to mutate?
That is where calibration enters.
Calibration Is Not a Tuning Console
The wrong calibration UX looks like this:
{
"stage_1_percent": 30,
"stage_2r_percent": 25,
"stage_3_percent": 45,
"ca_no_disclaimer_cap": 61,
"b2_partial_credit_mode": "looser"
}
This is editable.
But editable is not the same as governed.
Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:
- clinical-adjacent claims should be treated very strictly
- reproducibility matters strongly in this environment
- README polish should not outweigh code evidence
- a casual mention of "limitations" should not count as meaningful transparency
That is why the current calibration design starts with posture questions, not raw constants.
The goal is not to ask a researcher to become a scoring-engine maintainer.
The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.
The 1–5 Scale Is Input, Not Authority
In the current design, the user-facing intent layer uses a 1–5 scale:
-
1= minimal emphasis -
2= light emphasis -
3= moderate emphasis -
4= strong emphasis -
5= very strong emphasis
The important line is this:
the
1–5scale is a UX input surface, not part of the formal score engine.
That means the user can express posture in a natural way:
- clinical strictness
- code-integrity priority
- reproducibility priority
- structured limitations requirement
But those answers do not directly become score constants.
They are translated through explicit rules.
The current decision table is intentionally narrow:
| Condition | Outcome |
|---|---|
clinical_strictness >= 4 and reproducibility_priority <= 3
|
recommend strict_clinical_adjacency
|
all four values are 2 or 3
|
keep default
|
| no named-profile rule matches | generate a preview_only profile delta from bounded deltas only |
This table should not be mistaken for an empirically optimized model.
It is a conservative governance rule table.
The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.
That matters because a calibration system can fail in two opposite ways:
- it can be too rigid for domain experts to use
- it can be so flexible that every local preference becomes a new score
The initial rule table chooses the safer failure mode.
If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to preview_only.
For example:
clinical_strictness = 4
reproducibility_priority = 4
That does not automatically recommend strict_clinical_adjacency.
It falls back to preview_only, because two strong postures are competing and no release-grade named profile currently resolves that conflict.
A hidden similarity function might produce something that looks more flexible.
But it would also make the governance harder to audit.
A narrow rule table is less magical.
It is also safer.
What the CLI Is Allowed to Do







Top comments (0)