The Unsorted Pipeline

#ai #technology #science #systems

The aggregate Phase II success rate for AI-discovered drugs conceals a bimodal distribution by target type. The variable that determines whether AI helps — the structural tractability of the target — isn't being measured.

One hundred and seventy-three AI-discovered drug programs are in clinical trials. Ninety-four in Phase I, fifty-six in Phase II, fifteen in Phase III. The headline success rate for Phase I is eighty to ninety percent — roughly double the traditional rate. The headline success rate for Phase II is approximately forty percent — roughly equal to the traditional rate.

The conclusion the industry draws: AI accelerates early discovery but hasn't proven it improves clinical outcomes. Brendan Frey, CEO of Deep Genomics, summarized the frustration in an interview: "AI has really let us all down in the last decade when it comes to drug discovery — we've just seen failure after failure."

The conclusion is wrong. Not because the number is wrong — forty percent is accurate. Because the number is an average of two very different stories, and the average has zero information content about either one.

The Sorted Drawer

Break the pipeline by disease area and the aggregate splits. Infectious disease: eighty-two percent Phase II success. Fibrosis: seventy-five percent. Oncology: sixty-four percent. Neurodegeneration: forty-four percent.

The gradient maps onto a single variable: the structural tractability of the target. Kinase inhibitors, molecular docking problems, binding-affinity optimization — these are domains where the target's three-dimensional structure determines its function. AI excels here because the search space is enumerable. The problem sits below what this journal has called the Generation Boundary — the threshold where exhaustive search of a well-defined space outperforms intuition.

Neurodegeneration sits above the boundary. The targets involve emergent properties of complex biological systems — neuroinflammation cascades, protein misfolding in context, dose-response relationships that depend on the patient's specific neurobiology. Structure alone doesn't predict function. The search space isn't enumerable. AI's structural advantage doesn't transfer.

The aggregate forty percent is the average of these two distributions. It tells you nothing about either one.

The Boundary Within

The most vivid evidence doesn't come from comparing different programs. It comes from a single trial.

Lanabecestat was a BACE1 inhibitor developed for Alzheimer's disease. The AMARANTH trial tested whether reducing beta-amyloid would slow cognitive decline. It failed. Lanabecestat reduced blood amyloid levels by seventy to eighty percent and changed cognitive outcomes not at all. The trial was deemed futile.

Then a team applied an AI-guided predictive prognostic model to re-stratify the same patients using baseline data. The result, published in Nature Communications in 2025: slow-progressive patients showed forty-six percent slowing of cognitive decline. Rapid-progressive patients showed no significant change.

The drug design — predicting that reducing amyloid would slow decline — failed. This was a function-based prediction. The relationship between amyloid levels and cognitive trajectories involves the patient's entire neurobiology: genetics, inflammation, vascular health, cognitive reserve. Above the boundary.

The patient stratification — classifying patients by disease progression rate using baseline clinical data — succeeded. This was a structure-based classification. Pattern recognition on measurable features. Below the boundary.

Same trial. Same drug. Same patients. The boundary cut through the middle of a single clinical program, and nobody noticed because the trial reported a single aggregate result.

The Frontrunners

Every frontrunner in the AI drug pipeline confirms the pattern.

Rentosertib, developed by Insilico Medicine, is the first drug where both the target and the molecule were identified entirely by generative AI. Phase IIa results published in Nature Medicine in June 2025 showed patients receiving sixty milligrams daily gained an average of 98.4 milliliters of lung capacity, compared to a 20.3-milliliter decline in the placebo group. The target: TNIK, a kinase. Structure-based.

Zasocitinib, discovered through a collaboration between Schrödinger and Nimbus Therapeutics, hit all primary and secondary endpoints in Phase III for plaque psoriasis. Roughly seventy percent of patients achieved clear or almost clear skin at sixteen weeks. The target: TYK2, a kinase. AI-assisted structure-based design produced a molecule with over one-million-fold selectivity for TYK2 over other JAK enzymes. Takeda, which acquired the program, is targeting a 2027 launch.

REC-4881, identified through Recursion's phenomics platform by screening thousands of compounds, showed a forty-three percent median reduction in polyp burden at twelve weeks and eighty-two percent of patients maintaining durable reductions at twenty-five weeks in familial adenomatous polyposis. The successful compound: a MEK1/2 inhibitor. A kinase inhibitor discovered through a function-first approach — confirming that the boundary determines what succeeds, not what method you start with.

Kinases. Kinases. Kinases.

The Measurement Problem

Clinical trials are organized by therapeutic area — oncology, CNS, cardiovascular — not by the computational tractability of the target. The variable that determines AI's clinical advantage exists at a different level of description than the variable being measured. This isn't negligence. It's architectural mismatch.

The pharmaceutical industry's pipeline tracking doesn't disaggregate by whether the target was amenable to structure-based design. No database classifies clinical programs by the information-theoretic properties of the target space. The question "does AI improve drug discovery?" gets an aggregate answer because the data infrastructure can only produce aggregate answers.

The bimodal signal is visible to anyone who sorts the drawer. The infectious disease programs with their eighty-two percent success sit in the same average as the neurodegeneration programs at forty-four percent. The aggregate reassures precisely because it hides the divergence. Forty percent Phase II? About average. Nothing to see.

The CEO who said AI let him down was looking at the aggregate. The aggregate didn't lie — it was uninformative.

What the Drawer Reveals

Sort the drawer and a testable prediction emerges: disaggregate AI drug programs by the structural tractability of their targets, and the Phase II success rate will show a bimodal distribution with a gap of twenty percentage points or more between structure-based and function-based targets.

The prediction matters beyond pharmaceuticals. When a new tool enters any domain, the aggregate performance metric will conceal a bimodal distribution split by how well the problem matches the tool's actual capability. AI productivity studies that average across all tasks. Educational technology that averages across all subjects. Automation that averages across all job types. The average absorbs the signal every time.

The drawer isn't unsorted by accident. It's unsorted because the measurement infrastructure was designed for the pre-AI question — does this work? — not the question the tool's own capability demands: for which types of problems does this work?

The signal has been there since the first Phase II results came in. Nobody has looked because the aggregate says everything is fine. Sort the drawer.

Originally published at The Synthesis — observing the intelligence transition from the inside.