DEV Community

Cover image for When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x
Kwansub Yun
Kwansub Yun

Posted on • Originally published at flamehaven.space

When Control Becomes Authority: Calibration Governance in STEM BIO-AI 1.7.x

Control slowly becomes authority when nobody marks the boundary.

That is the calibration problem I kept running into while building STEM BIO-AI.

At first, STEM BIO-AI was centered on the score. It scanned a local bio or medical AI repository, inspected observable repository surfaces, and mapped the repository to a structured review tier.

That was useful.

But it was not enough.

The harder problem was not producing a number. The harder problem was preventing every useful adjacent signal from becoming part of that number.

In a bio/medical AI repository review system, several lanes can look similar if the tool is not careful:

  • deterministic scoring
  • diagnostic findings
  • replication evidence
  • advisory interpretation
  • domain-specific review posture

They all matter.

But they should not all have the same authority.

That is the core reason calibration became a governance problem in the 1.7.x line.

The principle is simple:

easy experimentation, hard drift
STEM BIO-AI should let researchers express review posture. It should let operators simulate policy changes. It should make policy metadata visible in artifacts.

But it should not let those inputs silently mutate the official score.


A Short Context for New Readers

STEM BIO-AI is a deterministic evidence-surface scanner for bio and medical AI repositories.

It does not validate biomedical efficacy. It does not certify clinical safety. It does not prove that a model is correct.

It scans observable repository surfaces such as:

  • README and docs
  • code structure
  • CI configuration
  • dependency manifests
  • changelogs
  • evidence and boundary language

The formal score is currently built from three weighted score-bearing stages, plus an explicit credential penalty and clinical cap or hard-floor logic:

Stage Role
Stage 1 README / stated evidence boundary
Stage 2R repo-local consistency
Stage 3 code and bio-responsibility surface

The active formula still also applies:

  • C1_penalty when hardcoded credentials are detected
  • score_cap or t0_hard_floor when clinical-adjacent boundary rules require it

Stage 4 exists, but it is a separate replication lane. It reports reproducibility and replication posture without automatically changing the formal score.

That separation is intentional.


What Is Actually Implemented in the Current 1.7.5 State of 1.7.x

Before discussing calibration philosophy, the implementation boundary has to be clear.

In the current 1.7.5 state of the 1.7.x line, STEM BIO-AI has implemented a real calibration architecture, but it is still mostly a mirror-only and preview-oriented architecture.

This post describes the current released state of the 1.7.x line as of v1.7.5, not a future authoritative-read-through design.

Implemented surfaces include:

  • packaged calibration profiles
  • schema and runtime validation
  • profile identity surfaced in result metadata
  • stem policy list
  • stem policy explain
  • stem policy derive
  • stem policy simulate
  • simulation-only local profile files
  • profile hashes and read-mode metadata in artifacts

The current named recommendation surface is intentionally narrow:

  • default
  • strict_clinical_adjacency

reproducibility_first is still a draft posture, not an active release-grade named recommendation.

The important limitation is this:

the authoritative scan scoring path is still protected from arbitrary user-provided profile mutation.

In other words, scan --policy <name> can surface selected profile metadata. policy derive and policy simulate can show governed preview behavior. But user-provided profile files do not simply become the official scoring authority.

More specifically, local profile files are currently accepted only by stem policy simulate, and the CLI rejects them unless the file remains mirror_only.

That is not a missing convenience.

That is the boundary being tested before it is allowed to become authority.


The Pressure That Causes Drift

Formal score and advisory tuning drift

One question pushed this design forward:

*If advisory AI becomes more capable, will teams really keep the boundary between formal score and advisory interpretation?
*

I do not think the answer is automatically yes.

If an advisory layer becomes helpful, there will always be pressure to let it influence the formal score "just a little."

That is usually how audit systems drift.

The score stops being a stable artifact and starts becoming a moving interpretation layer.

The danger is not that users want control.

The danger is that control slowly becomes authority without anyone noticing.

So the design question is not:

How do we let people tune the system more freely?

The design question is:

How do we let people express domain judgment without making the formal score easy to mutate?

That is where calibration enters.


Calibration Is Not a Tuning Console

The wrong calibration UX looks like this:

{
  "stage_1_percent": 30,
  "stage_2r_percent": 25,
  "stage_3_percent": 45,
  "ca_no_disclaimer_cap": 61,
  "b2_partial_credit_mode": "looser"
}
Enter fullscreen mode Exit fullscreen mode

This is editable.

But editable is not the same as governed.

Most researchers, operators, and domain reviewers do not think in raw score constants. They usually know something closer to this:

  • clinical-adjacent claims should be treated very strictly
  • reproducibility matters strongly in this environment
  • README polish should not outweigh code evidence
  • a casual mention of "limitations" should not count as meaningful transparency

That is why the current calibration design starts with posture questions, not raw constants.

The goal is not to ask a researcher to become a scoring-engine maintainer.

The goal is to let a researcher express domain posture while keeping the formal scoring boundary visible, versioned, and difficult to mutate accidentally.


The 1–5 Scale Is Input, Not Authority

Posture over raw constants

In the current design, the user-facing intent layer uses a 1–5 scale:

  • 1 = minimal emphasis
  • 2 = light emphasis
  • 3 = moderate emphasis
  • 4 = strong emphasis
  • 5 = very strong emphasis

The important line is this:

the 1–5 scale is a UX input surface, not part of the formal score engine.

That means the user can express posture in a natural way:

  • clinical strictness
  • code-integrity priority
  • reproducibility priority
  • structured limitations requirement

But those answers do not directly become score constants.

They are translated through explicit rules.

Governing decision matrix

The current decision table is intentionally narrow:

Condition Outcome
clinical_strictness >= 4 and reproducibility_priority <= 3 recommend strict_clinical_adjacency
all four values are 2 or 3 keep default
no named-profile rule matches generate a preview_only profile delta from bounded deltas only

This table should not be mistaken for an empirically optimized model.

It is a conservative governance rule table.

The current threshold choices are design-steward decisions, not claims of statistical optimality. Their purpose is to keep the translation layer narrow, reviewable, and non-authoritative until a stronger benchmark-backed promotion process exists.

That matters because a calibration system can fail in two opposite ways:

  • it can be too rigid for domain experts to use
  • it can be so flexible that every local preference becomes a new score

The initial rule table chooses the safer failure mode.

If a posture is clearly within an existing release-grade profile, the system can recommend that profile. If the posture is ambiguous or combines competing priorities, the system falls back to preview_only.

For example:

clinical_strictness = 4
reproducibility_priority = 4
Enter fullscreen mode Exit fullscreen mode

That does not automatically recommend strict_clinical_adjacency.

It falls back to preview_only, because two strong postures are competing and no release-grade named profile currently resolves that conflict.

A hidden similarity function might produce something that looks more flexible.

But it would also make the governance harder to audit.

A narrow rule table is less magical.

It is also safer.


What the CLI Is Allowed to Do

![Easy experimentation, hard drift — sandbox and vault

The preview workflow can look like this:

stem policy derive \
  --clinical-strictness 5 \
  --code-integrity-priority 4 \
  --reproducibility-priority 3 \
  --structured-limitations-requirement 4
Enter fullscreen mode Exit fullscreen mode

or this:

stem policy simulate /path/to/repo --profile-file my_profile.json
Enter fullscreen mode Exit fullscreen mode

But those flows are not the same as saying:

stem scan /path/to/repo --stage1-weight 0.35 --cap 72
Enter fullscreen mode Exit fullscreen mode

The first two are governed preview surfaces.

The last one is an untracked tuning console.

The design intentionally supports the first and rejects the shape of the last.

This is the practical meaning of easy experimentation, hard drift.


What Actually Gets Verified

The central claim of this design is not:

the current calibration rules are perfect.

The claim is narrower:

calibration changes should not become score authority without a visible governance path.

That claim can be tested by checking whether the system exposes or blocks the relevant control surfaces.

Drift risk Expected control How to verify it
arbitrary score tuning no free-form CLI weight / cap override CLI help and accepted options do not expose direct score constants
hidden profile mutation profile status and read mode are surfaced result artifacts expose profile metadata
unclear profile identity profile name, version, and hash are visible scan output includes calibration profile identity
advisory influence leakage advisory output cannot override score advisory response validation cannot mutate final_score
reproducibility overcompensation Stage 4 remains separate replication_score does not change formal_tier
premature named-profile expansion ambiguous postures fall back to preview derive/simulate returns preview_only when no named rule matches
detector promotion drift evidence-only detectors are not score-authoritative detector policy is versioned in policy files and governance docs, even though per-detector score-integration status is not yet surfaced as first-class artifact metadata

This is still not the same as a full empirical benchmark.

But it is a real verification target.

The system can be checked for whether it allows the forbidden mutation path.

That is the level of proof appropriate for this release line: not "the final policy is optimal," but "the policy cannot quietly become authoritative without leaving a trace."

That trace is stronger for some surfaces than others. Profile identity, hash, and read mode are already artifact-visible in 1.7.5. Detector promotion semantics are already versioned and documented, but they are not yet surfaced as first-class per-detector policy metadata in the result object.


The B2 Tightening Example

Deterministic boundary changes in B2 tightening

The clearest scoring example is Stage 3 B2.

B2 is the bias and limitations measurement surface. Earlier scoring behavior allowed a weaker boundary: a simple vocabulary-level signal could still receive partial credit.

That became too permissive.

A repository that mentions "bias" or "limitations" once is not necessarily disclosing a meaningful boundary. It may only be surface signaling.

So the B2 rule became stricter.

The important change is not a marketing claim about benchmark improvement. The important change is a deterministic boundary change:

Case Earlier posture Tightened posture
no bias / limitations vocabulary 0 0
minimal single-term mention only partial credit possible 0
structured limitations language partial credit possible partial credit possible
quantitative measurement evidence full credit possible full credit possible

This is the first place where calibration becomes visible as more than a principle.

The rule change creates a concrete score path difference:

a repository that previously depended only on a minimal single-term limitations mention no longer has a B2 partial-credit path after the tightening.

That is the current public claim.

I am not presenting a benchmark-wide before/after score delta here, because that would require a pinned fixture set and published comparison protocol.

Without that, a claimed "T3 became T2" example would be anecdotal at best and misleading at worst.

So the honest evidence level is rule-level impact:

  • the credit path changed
  • the changed path is deterministic
  • the changed path is inspectable
  • benchmark-level deltas should be published only when the fixture protocol is pinned

In clinical-adjacent repositories, limitation language is not decoration. It is part of the claim boundary.

A one-word mention does not carry the same weight as a structured limitations section, demographic coverage statement, known failure-mode description, or quantitative subgroup analysis.

This is why calibration cannot be only a UI problem.

If a user asks for a stricter limitations posture, the system should not silently subtract points through a hidden override. It should expose the rule that changed and the reason that rule exists.

That is the difference between a score tweak and a governed scoring rationale.


Why Stage 4 Stays Separate

Importance is not score authority

Stage 4 is the place where the strongest counterargument appears.

The counterargument is fair:

If reproducibility is important, why does it not affect the formal score?

My answer is that importance and score authority are not the same thing.

Stage 4 measures replication posture: containers, reproducibility targets, dependency locks, artifact references, seeds, citation surfaces, and similar evidence.

Those signals matter.

But they do not mean the same thing as the formal claim boundary.

A repository can be highly reproducible and still make unsafe or unbounded clinical claims.

A repository can have clean containers and dependency locks while still lacking a clinical-use disclaimer.

A repository can be easy to rerun while still having weak data provenance or shallow limitation language.

If Stage 4 were allowed to lift the formal score too early, reproducibility could start compensating for claim-boundary weakness.

That would be a different scoring philosophy.

It may become valid in the future, but only if the rule is explicit.

For now, Stage 4 is reported as a separate lane because the system is saying:

  • reproducibility matters
  • reproducibility should be visible
  • reproducibility should affect review interpretation
  • reproducibility should not silently override the formal score boundary

That is why stronger reproducibility intent currently falls back to preview_only instead of becoming a release-grade named profile.

The system is not saying reproducibility is unimportant.

It is saying reproducibility has not yet been granted formal score authority.


Advisory AI Uses the Same Boundary

Advisory AI follows the same rule.

Helpful interpretation is not score authority.

STEM BIO-AI can export provider-neutral advisory packets and validate downstream advisory responses, but the deterministic scanner does not need an external model runtime to produce the formal score.

If an advisory system becomes useful, it may help interpret findings, prioritize review, or explain evidence patterns.

But unless a future release explicitly changes the policy, advisory output remains structurally subordinate to the deterministic score.

That is enough for this article.

The broader advisory boundary is a separate topic.


From Scoring Tool to Audit Workflow

From scoring tool to audit custody

The 1.7.x transition is best understood as a shift in the questions the tool is expected to answer.

Earlier scoring-tool question Audit-workflow question
What score did the repository get? Which policy profile was visible when the score was produced?
Which stage contributed most? Was that stage score-authoritative, diagnostic, or separate-lane evidence?
What evidence triggered the tier? Did the evidence change the formal score or only the review posture?
What should the user fix? Would a proposed policy change be preview-only, experimental, benchmark-candidate, or release-authoritative?

This is why I describe 1.7.x as an audit-system transition.

The score still matters.

But the system is increasingly designed around the custody of the score: where it came from, what was allowed to influence it, and what was intentionally kept outside it.


What This Still Does Not Do

This boundary is just as important as the implementation.

STEM BIO-AI still does not:

  • validate biomedical efficacy
  • certify benchmark truth
  • determine clinical deployment safety
  • let advisory AI overwrite the formal score
  • open arbitrary numeric tuning in the official scan path
  • allow profile experimentation to become official policy without governance

Those are not missing conveniences.

They are boundaries.

A strong repository evidence tier is still an observable repository-surface signal. It is not clinical clearance, regulatory approval, or proof of biomedical validity.


The Next Version Direction

The next step: policy parity

The next important step is not adding more knobs.

It is authoritative policy read-through in parity mode.

That means:

  • the default policy profile becomes the source read by the scoring path
  • existing fixtures should show no score or tier drift
  • policy hashes remain visible in artifacts
  • non-default and researcher-provided profiles remain governed preview surfaces until promoted
  • score-affecting policy changes become explicit release events

This is not a big-bang rewrite.

It is authority relocation.

The goal is to move score-affecting constants into versioned policy objects without changing the score by accident.

Only after that parity step does it become safe to discuss broader named profiles.


Final Position

The calibration problem is not really about giving users more control.

It is about deciding when control becomes authority.

If every useful signal can gradually influence the score, the score stops being an audit artifact.

It becomes a negotiation.

That is what STEM BIO-AI is trying to avoid.

Researchers should be able to express posture.

Operators should be able to simulate alternatives.

Policy stewards should be able to promote changes.

But the formal score should not move unless the governance path says it moved.

That is the difference between a tuning console and an audit system.

Top comments (0)