Kwansub Yun

Posted on Jan 26

RExSyn Nexus 0.6.1 - Stop Hallucinating Proteins: How We Built a 7D Reasoning Engine with AlphaFold3

#bioinformatics #alphafold #architecture #cleanarchitecture

Before we talk about “BioAI agents,” we need to admit the real failure mode:

LLMs are great at producing hypotheses that sound scientific — and quietly violate physics.

In this post, I’ll show:

why “plausible trash” happens in biomedical reasoning,
how RExSyn Nexus v0.6.1 adds Structure as a first-class reasoning dimension,
and how to run it (API + Python), with deterministic CI testing.

The problem nobody likes to admit: “Plausible Trash”

In autonomous biomedical research, LLMs can write hypotheses that sound like science.

“Attach a 50kDa PEG chain to the binding pocket of Protein X to improve solubility.”

Semantically? Great.

Structurally? Often impossible.

And this is the failure mode I care about:

When a system is “logically convincing” but physically wrong.

So my question became:

Can we make the pipeline refuse a hypothesis — and explain exactly why — using physics-derived signals?

That’s what RExSyn Nexus v0.6.1 is about: adding the 7th dimension to our reasoning model.

We moved from M-A-D-I-F-P to M-A-D-I-F-P-S, where S = Structure and it’s computed from AlphaFold3 confidence outputs.

What M-A-D-I-F-P-S actually means

M-A-D-I-F-P-S is not a “magic acronym.”
It’s a reasoning checklist — seven lenses the engine uses to decide whether a hypothesis deserves to survive.

M — Methodic: Did we follow a disciplined procedure (inputs, constraints, reproducible steps), or are we hand-waving?
A — Abductive: What is the best explanation that fits the evidence we have right now? (plausible hypothesis generation)
D — Deductive: If the hypothesis is true, what must be true next? (logical consequences; consistency checks)
I — Inductive: Does it generalize from prior cases / datasets / known patterns, or is it a one-off story?
F — Falsification: What would disprove this quickly? (designing refutation tests; “how can this fail?”)
P — Paradigm: Is it compatible with established domain constraints and assumptions (biology, chemistry, protocols), or does it violate the frame?
S — Structure (new in v0.6.1): Even if it’s semantically convincing, is it physically viable? In v0.6.1, S is computed from AlphaFold3 confidence signals (e.g., clash/disorder/PAE-style uncertainty) and can veto a hypothesis.

In short: 6D (M-A-D-I-F-P) prevents “logical nonsense.”
7D (+S) prevents “physically impossible but linguistically plausible” ideas.

What RExSyn Nexus is (in one minute)

RExSyn Nexus is a pipeline that combines:

a LOGOS reasoning core (IRF-Calc + AATS),
semantic anchoring (embeddings),
structural confidence (AlphaFold3 confidence schema),
and scientific validators (PoseBusters / DockQ / SAXS), exposed as an API + job workflow.

LOGOS itself is defined as IRF-Calc’s 6D framework + AATS v2.1, plus bridge components like drift control and calibration gates.

Why v0.6.1 matters (the “why”, not the “what”)

Most stacks do this:

Generate hypothesis
Add citations
Ship “confidence” as a vibe

We wanted this instead:

Generate hypothesis
Ask physics if it’s plausible
If physics says “no,” reject — and log the reason

That’s why v0.6.1 is not a “feature party.”

It’s a structural hardening release: determinism, traceability, schema rigor, and compliance baked into runtime.

This release also formalizes the Sovereign Adapter Pattern with three constraints:

Drift-free execution (same input → same output in deterministic modes)
License sovereignty (explicit acknowledgment gates for restricted terms)
Schema rigor (no fuzzy JSON)

The mental model: 6D vs 7D

Think of it like this:

6D reasoning answers: “Is this coherent?”
7D reasoning adds: “Would atoms allow it?”

So the engine can finally say:

“Your argument is deductively strong…

but your structure score is low (clash/disorder).

Therefore: reject.”

How it works (end-to-end)

Step 1) Mirror AF3 confidence outputs into a strict schema

Why? Because ad-hoc dicts are where “silent drift” hides.

We mirror AlphaFold3 confidence outputs with strict types (including NumPy arrays for heavy matrices).

from dataclasses import dataclass
from typing import Optional
import numpy as np
import numpy.typing as npt

@dataclass(frozen=True)
class AF3PredictionResult:
    """
    Schema-aligned mirror of AlphaFold3 confidence outputs.
    Strict typing prevents downstream drift.
    """
    ptm_score: float
    iptm_score: Optional[float]
    ranking_score: float
    fraction_disordered: float
    has_clash: bool

    chain_pair_pae_mean: npt.NDArray[np.float64]  # [samples, chains, chains]
    pae_ichain: npt.NDArray[np.float64]
    pae_xchain: npt.NDArray[np.float64]

The key detail: we enforce npt.NDArray[np.float64] because loose typing can create “slop”—ambiguous pipelines that look correct but hide drift and cost.

Step 2) Convert confidence metrics into a belief score (0..1)

Raw AF3 outputs are not “reasoning scores.”
They’re confidence/error signals.

So we normalize them into Structure (S): a belief score the reasoning engine can weigh against semantic arguments.

Here’s a minimal, explainable scorer (weights are heuristics; we’ll tune them later via validation feedback):

import numpy as np

class StructureScorer:
    """
    Convert AF3 confidence metrics into a normalized [0,1] structure belief score.
    Heuristic weights for now; later calibrated with validators (PoseBusters/DockQ/SAXS).
    """
    def __init__(self, max_fraction_disordered: float = 0.30):
        self.max_fraction_disordered = max_fraction_disordered

    @staticmethod
    def _normalize_ranking(ranking_score: float) -> float:
        """
        Conservative normalization example:
        Map an approximate (-100..1) range to [0,1].
        """
        x = (ranking_score + 100.0) / 101.0
        return float(np.clip(x, 0.0, 1.0))

    def score(self, af3: "AF3PredictionResult") -> float:
        # Base confidence from AF3 ranking_score
        ranking_component = 0.55 * self._normalize_ranking(af3.ranking_score)

        # Disorder penalty (too much disorder weakens structural viability)
        disorder_penalty = 0.0
        if af3.fraction_disordered > self.max_fraction_disordered:
            excess = af3.fraction_disordered - self.max_fraction_disordered
            disorder_penalty = 0.25 * float(np.clip(excess / 0.70, 0.0, 1.0))

        # Clash penalty (“reality veto”)
        clash_penalty = 0.40 if af3.has_clash else 0.0

        final_score = float(np.clip(ranking_component + 0.35 - disorder_penalty - clash_penalty, 0.0, 1.0))
        return final_score

The “aha” is not the formula.
The “aha” is: Structure becomes a first-class reasoning dimension, not a post-hoc chart.

Step 3) The Guard: License Compliance as Code

When you integrate restricted research assets, “compliance” can’t be a PDF someone forgets.

So we enforce an explicit acknowledgment gate at runtime:

from dataclasses import dataclass

@dataclass(frozen=True)
class AF3LicenseConfig:
    ack_cc_by_nc_sa: bool = False
    ack_non_commercial: bool = False
    ack_prohibited_uses: bool = False

    def is_valid(self) -> bool:
        return self.ack_cc_by_nc_sa and self.ack_non_commercial and self.ack_prohibited_uses

class LicenseGuard:
    def __init__(self, config: AF3LicenseConfig):
        if not config.is_valid():
            raise ValueError(
                "HALTING: License acknowledgement incomplete.\n"
                "Set ack_cc_by_nc_sa=True, ack_non_commercial=True, ack_prohibited_uses=True."
            )

This is “compliance by design.”
It prevents silent legal drift in real teams.

Step 4) Test it without GPUs (deterministic mock)

You can’t run structural inference on every CI run.
But random mocks create flaky tests.

So we use a deterministic mock adapter: same input sequence → same output artifact, every time.

import hashlib
import random
import numpy as np

class DeterministicMockAF3Adapter:
    """
    Hash-seeded mock: same sequence -> same AF3PredictionResult.
    Designed for CI/CD reproducibility.
    """
    def __init__(self, salt: str = "rexsyn-af3-mock-v0.6.1"):
        self.salt = salt

    def predict(self, sequence: str) -> AF3PredictionResult:
        h = hashlib.sha256((self.salt + sequence).encode()).hexdigest()
        seed = int(h[:16], 16)
        rng = random.Random(seed)

        # Minimal stable shapes for schema fidelity
        chain_pair_pae_mean = np.zeros((1, 1, 1), dtype=np.float64)
        pae_ichain = np.zeros((1, 1), dtype=np.float64)
        pae_xchain = np.zeros((1, 1), dtype=np.float64)

        return AF3PredictionResult(
            ptm_score=rng.uniform(0.2, 0.95),
            iptm_score=rng.uniform(0.2, 0.95),
            ranking_score=rng.uniform(-80.0, 0.9),
            fraction_disordered=rng.uniform(0.0, 0.6),
            has_clash=(rng.random() < 0.08),
            chain_pair_pae_mean=chain_pair_pae_mean,
            pae_ichain=pae_ichain,
            pae_xchain=pae_xchain,
        )

“Okay—but how do I use it?”

Here are two practical entry points.

Option A) Use the API (job workflow)

POST /api/v1/predict
GET /api/v1/jobs/{job_id}
GET /api/v1/jobs/{job_id}/result

Mental model:

submit a prediction job
poll status
retrieve result + scores + artifacts

Here’s the fastest way to feel the pipeline:

curl -s -X POST https://YOUR_DOMAIN/api/v1/predict \
  -H "Content-Type: application/json" \
  -d '{"sequence":"MKFLK...","method":"alphafold3","irf_7d_enabled":true}' | jq
# -> {"job_id":"job_01HXYZ...","status":"queued"}
curl -s https://YOUR_DOMAIN/api/v1/jobs/job_01HXYZ... | jq
# -> {"job_id":"job_01HXYZ...","status":"running"}
curl -s https://YOUR_DOMAIN/api/v1/jobs/job_01HXYZ.../result | jq
# -> {"final_decision":"REJECT","scores":{"irf7":0.73,"structure_s":0.18},
#     "signals":{"has_clash":true,"fraction_disordered":0.41,"ranking_score":-62.3}}

If you see REJECT with has_clash=true or high disorder, that’s the point:
semantic plausibility didn’t pass the laws of physics.

Option B) Use the LOGOS service in Python (reasoning workflow)

`python
from src.logos.rexsyn_service import RExSynLOGOSService

logos = RExSynLOGOSService(
enable_irf=True,
enable_aats=True,
enable_drift_control=True
)

structure_result = logos.reason_about_structure(
query="Evaluate this prediction for drug discovery",
scores={"dockq_score": 0.79, "saxs_chi2": 1.85, "druggability_score": 0.82},
meta={"sequence": "MKFLK.", "target_class": "kinase", "method": "alphafold2"}
)

if structure_result["final_decision"] == "PASS":
validation = logos.validate_hypothesis(
hypothesis="This structure is suitable for structure-based drug design",
evidence=structure_result,
domain="biomedical"
)
print(validation["validation_status"], validation["consensus"])
`

The point: you’re not just getting a structure score — you’re getting a reasoned decision object.

What makes this “special” (not just another pipeline)

1) Structure is not decoration; it’s a reasoning axis

S is computed, normalized, and used for accept/reject decisions.

2) Schema is treated like governance

Strict types prevent “silent drift” and “slop” patterns.

3) Trust claims are tied to verification artifacts

Instead of “trust me,” we ship reproducibility hooks (schema parity + deterministic mocks + validation surfaces).

(If you hate symbolic scoring: same. The point isn’t the symbol. The point is: verification is a first-class shipping artifact.)

What I’m improving next (what to watch)

Adaptive weighting: replace heuristic constants with calibration from PoseBusters/DockQ/SAXS feedback loops
Cascade reasoning: early-exit 7D when 6D already fails
Multi-chain interface focus: score binding-interface regions, not only global matrices

Closing

Most systems can generate biomedical text.
Few systems can say:

“This is coherent, but physically invalid — here’s why.”

RExSyn Nexus v0.6.1 is my attempt to build that kind of refusal — auditable, deterministic, and grounded.

DEV Community