Kwansub Yun

Posted on Mar 10 • Edited on Mar 16 • Originally published at flamehaven.space

How do you know when your entire AI pipeline is wrong — not just one model? (EXP-033)

#mlops #biomedical #proteinstructure #governance

TL;DR — One question: can we preserve EXP-032's classification behavior (passing evidence stays passing, failing evidence stays blocked) while safely extending the pipeline with AlphaGenome on/off testing and a fully auditable governance layer?

Measured result: Yes — on a locked control set (n=2 samples, n=6 arm-level observations), under automated stage-gate discipline.

Classification accuracy: 1.0 on control set across all three test arms (control-set result, not a population generalization)
Adding AlphaGenome: no accuracy degradation, no dangerous false-passes on this measured set
Governance audit layer: every decision now traceable to a specific rule and component score — not just a verdict
Stage-gate progression: PASS_CORE → HOLD → PASS (a real bug was caught and fixed mid-run)

1. Why this happens (The "Pipeline Blind Spot")

A while back I posted a question: "How do you know when AlphaFold is hallucinating?"

That background motivated this experiment, but it is not part of the measured evidence below.

But there's a harder version of the same question:

How do you know when your entire pipeline is hallucinating — not just one model, but the chain of decisions built on top of it?

A model returns a confident-looking structure. Every downstream system processes it cleanly. The report looks polished. And the scientific object you thought you validated was never actually tested.

That's the problem this experiment is working on.

2. Why we're on experiment 33 — and what makes this different from other bio AI

Tools like AlphaFold2, AlphaFold3, Chai-1, and Boltz-1 are generation engines — they take inputs and produce outputs. RExSyn doesn't generate answers. It governs them.

The question isn't "what does the model think?" It's "should we trust what the model thinks — and can we prove it?"

That's why governance tools need re-validation every time something changes — a model version, a data source, a pipeline component. Generation tools converge. Governance tools don't. Each experiment is a checkpoint.

The deeper problem most bio AI pipelines ignore: components are validated in isolation. Nobody validates the chain.

A 90% accurate capture layer × 85% reliable transfer layer × 97% accurate model = ~74% end-to-end reliability.

Individual benchmarks say nothing about that compounding. RExSyn measures it explicitly — every layer, every run, with a traceable verdict.

The key components in this experiment:

LawBinder — policy compliance engine. Fail-closed: when in doubt, it escalates rather than clears.
CareChain Governance Engine (CCGE) — B2B collaboration project. The audit layer that records not just what the verdict was, but which rule triggered and which component score fell short. Built around one equation:
p_e2e = p_capture × p_transfer × p_model × p_clinical_interpretation
Fail-closed: any single component below its floor → BLOCK, regardless of the overall score.
AlphaGenome — Google DeepMind's genomic prediction API. Tested on/off under identical conditions to measure governance impact.
Discord score — disagreement index between internal reviewers. Tracked separately from the verdict — a signal, not a decision.

2. What PASS/BLOCK actually means — and why EXP-032 wasn't enough

EXP-032 showed that the pipeline could separate passing evidence from failing evidence with perfect accuracy on the control set.

That's a meaningful result. But "it worked once" is not the same as "it's reproducible." And "it's reproducible" is not the same as "I can prove it, with signed artifacts, that anyone can independently verify."

EXP-032 also had a deeper gap: even when a case was correctly classified, the reason wasn't fully machine-readable. You could read the verdict. You couldn't automatically trace which governance rule drove it, or reconstruct the decision from component scores alone.

EXP-033 was built to close both gaps.

3.The five-gate checkpoint system

Previous experiments had a silent failure mode: run something, get a result, discover three experiments later that the result was based on drifted inputs or a missing artifact. By then the trail is cold.

EXP-033 introduced five sequential checkpoints. Fail one → stop at that gate → patch → rerun from there. No skipping.

Gate 1 — Classification Parity: Did passing cases get passed? Did failing cases get blocked? Must be perfect across all test arms for this control set (n=2; arm-level n=6).
Gate 2 — Reproducibility Integrity: Are all required output files present? Are artifact hashes unique across cases? Did all pipeline processes exit cleanly?
Gate 3 — Cross-Experiment Comparison: Does EXP-033 match EXP-032 on the same inputs? Classification must be identical. The only allowed changes are in the disagreement signal.
Gate 4 — Audit Layer Check: Did the governance engine output not just a verdict, but also the rule that triggered and the component scores behind it? All four required fields must be present.
Gate 5 — AlphaGenome Extension (optional): Does turning AlphaGenome on or off degrade accuracy or safety routing? Must not.

4.The HOLD — and why it was the right outcome

First run with AlphaGenome enabled: status = HOLD.

Gate 5 failed. Not because AlphaGenome broke anything — because Gate 5 couldn't evaluate whether it degraded anything. The AlphaGenome runner path wasn't injecting the expected verdict labels into its output. So the gate correctly said: "I cannot verify non-degradation. Blocking."

Fix: make label injection unconditional in the runner, regardless of which mode is active.

Rerun: Gate 5 passed. status = PASS.

Checkpoint report	Status	Full SHA-256
Initial (no AlphaGenome)	`PASS_CORE`	`D11C53DE0D63AD3404F34B6CD503F5C769B93F04C9F021D2378CBD87EB69D760`
With AlphaGenome, first run	`HOLD`	`6420A316EA998620690CE32BFD1579577F8B35FB23A1174C18C394F818483E63`
With AlphaGenome, after fix	`PASS`	`E20A777F8E37FF591788B91AF2296B4B840C171A68B8416E179CFA2C268D0022`

Without the checkpoint system, that bug would have traveled silently into every future AlphaGenome comparison. We'd have been measuring a broken comparison and calling it clean.

5. What the measurements showed

Run root: EXP-033-LAWBINDER-CRITIC-ALIGNMENT/artifacts/exp033_official_methodlock_parity_20260309_184159
Dataset: 16-file input set, locked from EXP-032 (n=2 samples, n=6 arm-level observations)
Manifest SHA-256: 27AE21B698980582E98B2073B8E0295BADC46FFC7764E4CE55B09A8882943011

Classification (verdict SHA-256: 017E9927FB559F1348827DF52C4339CB3BFF7A711DF876E855C0A5784ABD1281):

Accuracy 1.0 on control set — sample-level and across all three arms
Dangerous false-pass rate: 0.0
Full artifact path: D:\Sanctum\Flamehaven-Labs\Rexsyn Experiment\EXP-033-LAWBINDER-CRITIC-ALIGNMENT\artifacts\exp033_official_methodlock_parity_20260309_184159\exp033_verdict_benchmark.json
Full SHA-256: 017E9927FB559F1348827DF52C4339CB3BFF7A711DF876E855C0A5784ABD1281

EXP-032 vs EXP-033 (compare SHA-256: 4FA7A2709E4197756670805298406E828C4EDF668950A07F3CD46833ABB9C425):

Classification: identical
Only change: internal disagreement signal shifted slightly (delta −0.037) — verdict unaffected

Audit layer (Gate 4 — all four required fields present, rule-traceable):

Component	Score
Data capture quality	0.815
Transfer integrity	0.815
Model accuracy (contextual)	0.900
Clinical interpretation reliability	0.922
End-to-end reliability	0.563

The end-to-end score (0.563) being lower than any individual component is expected — it's a product of four multiplied terms. A pipeline where each component scores 0.90 produces p_e2e ≈ 0.66. Compounding is the point.

AlphaGenome on/off (SHA-256: 78603ABB6B47C611956DF736F5A1F99ADBBA272008EB251A15EA1B9BFAF2DB98):

Classification: unchanged either way
Safety routing: unchanged, zero dangerous false-passes
What shifted: two internal scoring signals moved slightly

6. What this means in plain terms

Control-set result (n=2 samples, n=6 observations) — not a population generalization.

✅ The pipeline correctly identified every case on the control set. Evidence that should pass, passed. Evidence that should be flagged, was flagged. Across three parallel test arms.
✅ We can now prove why each decision was made — provided the artifact bundle and hashes are available. Not just "it passed," but which specific quality check it cleared and which threshold it crossed.
✅ The system caught its own bug. When we plugged in AlphaGenome, the checkpoint system stopped the run — because it couldn't verify the comparison was valid. We found the bug, fixed it, reran. Pass.
✅ Adding AlphaGenome didn't break anything on this set. Same classification outcome on/off. No new dangerous false-passes. Effect limited to minor shifts in two internal quality scores.
⏳ What we don't know yet: whether AlphaGenome (or AlphaFold2, AlphaFold3, Chai-1, Boltz-1) actually improves decisions when models disagree. That's what EXP-034 is built to test.

7. What comes next

What happens when AlphaFold2, AlphaFold3, Chai-1, and Boltz-1 all run through the same pipeline simultaneously?

These models will disagree on some molecular cases. That disagreement — not the individual predictions — is where the interesting governance signal lives.

When strong models give different answers, the question isn't "which was right." It's "which verification layer is strong enough to tell the difference between a genuine scientific disagreement and a pipeline that silently failed?"

These results are control-set reproducibility outcomes, not population-level generalization estimates.

Artifact reference: exp033_fundamental_resource_20260310.json
Claim boundary: methodology / governance / reproducibility only. No efficacy or causal claim.