DEV Community

JXIONG
JXIONG

Posted on

Medical AI Doesn’t Just Need Bigger Models. It Needs an ImageNet for State Transitions

Whoever builds the “state–intervention–transition” dataset for biomedicine may define the next generation of medical AI infrastructure.

Author: Jianghui Xiong


Medical AI is moving beyond classification, risk prediction, and question answering.

The next frontier is not just:


sample → label

Enter fullscreen mode Exit fullscreen mode

or:


question → answer

Enter fullscreen mode Exit fullscreen mode

It is:


state + action → next state

Enter fullscreen mode Exit fullscreen mode

In other words:


current biological state + intervention → future biological state

Enter fullscreen mode Exit fullscreen mode

To build real biomedical world models, we need more than bigger models. We need something analogous to ImageNet — not for images, but for biological state transitions.

I will call this idea, for now:


Biomedical TransitionNet

Enter fullscreen mode Exit fullscreen mode

A shared infrastructure for recording, standardizing, and evaluating:


baseline biological state

- intervention
- follow-up biological state
- mechanism evidence
- uncertainty

Enter fullscreen mode Exit fullscreen mode

This article explains why such an infrastructure is needed, why it matters, and why it is scientifically difficult.

It does not claim that a complete biomedical world model already exists.


1. ImageNet was not just a dataset. It was infrastructure.

When people talk about the deep learning revolution in computer vision, they often mention AlexNet, VGG, ResNet, and other neural network architectures.

That is correct, but incomplete.

One of the most important enabling factors was ImageNet.

ImageNet was not merely a large collection of images. Its deeper value was that it gave computer vision a shared coordinate system:

  • a common task,
  • a common label hierarchy,
  • common training and test data,
  • common benchmarks,
  • and a way to compare progress across models and institutions.

Before ImageNet, many computer vision systems were difficult to compare because they were trained and evaluated on different datasets. ImageNet helped the field converge around shared evaluation.

That is why ImageNet became much more than a database. It became research infrastructure.

Medical AI may now need something similar.

But not an image dataset.

Medicine needs an ImageNet for state transitions.


2. Medical AI has many models, but not enough transition data

Today, we already have many types of medical AI systems:

  • medical large language models,
  • medical question-answering systems,
  • radiology models,
  • pathology models,
  • omics foundation models,
  • virtual cell models,
  • digital twin systems,
  • clinical decision support tools,
  • AI drug discovery platforms.

These are important.

But if we think the future of medical AI is only “a bigger medical chatbot”, we may miss the real challenge.

Medicine is not only about answering questions.

Medicine is about understanding and changing biological trajectories.

A clinician does not only ask:


What disease does this patient have?

Enter fullscreen mode Exit fullscreen mode

They also ask:


Why is this biological state happening?

What is driving deterioration?

Which mechanisms are actionable?

Which intervention may shift the trajectory?

How should the response be measured?

What if the expected response does not happen?

What if an adverse response appears?

Enter fullscreen mode Exit fullscreen mode

These are not just language problems.

They are state transition problems.

Most medical AI today is still closer to:


sample → label

Enter fullscreen mode Exit fullscreen mode

or:


question → answer

Enter fullscreen mode Exit fullscreen mode

But biomedical world models require something closer to:


state + action → next state

Enter fullscreen mode Exit fullscreen mode

That is the key shift.


3. What is a biomedical world model?

In AI, a world model is usually understood as an internal model that helps an agent simulate how the environment changes after an action.

A simple abstraction is:


current state + action → future state

Enter fullscreen mode Exit fullscreen mode

In robotics, this may mean:


robot pose + motor command → next scene state

Enter fullscreen mode Exit fullscreen mode

In autonomous driving, it may mean:


traffic scene + driving action → future traffic scene

Enter fullscreen mode Exit fullscreen mode

In biomedicine, the analogous formulation would be:


biological state + intervention → future biological state

Enter fullscreen mode Exit fullscreen mode

This could apply at multiple scales:


cell state + perturbation → cellular response

tissue state + treatment → tissue response

patient state + intervention → follow-up state

Enter fullscreen mode Exit fullscreen mode

A biomedical world model should therefore not be understood as a medical chatbot.

It is not merely:


medical text in → medical text out

Enter fullscreen mode Exit fullscreen mode

A more meaningful biomedical world model would combine:


state representation

- intervention representation
- transition modeling
- mechanism evidence
- uncertainty estimation
- feedback correction

Enter fullscreen mode Exit fullscreen mode

That is much harder than ordinary medical QA.

And it requires a different kind of data.


4. Why medicine needs its own ImageNet

In computer vision, a basic supervised learning unit can often be simplified as:


image + label

Enter fullscreen mode Exit fullscreen mode

For biomedical world models, the basic unit should look more like:


baseline state + action + follow-up state

Enter fullscreen mode Exit fullscreen mode

Or mathematically:


S(t) + A → S(t + Δt)

Enter fullscreen mode Exit fullscreen mode

Where:


S(t)       = biological state before intervention

A          = action or intervention

S(t + Δt)  = biological state after intervention

Δt         = time interval

Enter fullscreen mode Exit fullscreen mode

This is fundamentally different from a static medical database.

A biomedical world model does not only need:

  • medical images,
  • electronic health records,
  • omics profiles,
  • drug-target databases,
  • clinical notes,
  • literature graphs.

Those are useful, but insufficient.

It needs structured longitudinal data describing:


what the biological state was,

what action was taken,

what changed afterward,

over what time scale,

with what evidence,

and with what uncertainty.

Enter fullscreen mode Exit fullscreen mode

This is why medicine needs something like a Biomedical TransitionNet.

Not a direct copy of ImageNet.

A new infrastructure designed for biological state transitions.


5. What should one data unit look like?

A conventional supervised learning sample may look like:


x → y

Enter fullscreen mode Exit fullscreen mode

Examples:


image → diagnosis label

clinical note → ICD code

genomic variant → risk category

Enter fullscreen mode Exit fullscreen mode

A biomedical world-model sample should look more like:


state_before

- intervention
- state_after
- time_interval
- evidence_chain
- uncertainty

Enter fullscreen mode Exit fullscreen mode

A simplified schema might look like this:


{

"baseline_state": {

"molecular": "...",

"clinical": "...",

"phenotype": "...",

"lifestyle": "...",

"context": "..."

},

"action": {

"type": "...",

"dose": "...",

"frequency": "...",

"duration": "...",

"mechanism": "..."

},

"follow_up_state": {

"molecular": "...",

"clinical": "...",

"phenotype": "...",

"adverse_events": "..."

},

"transition": {

"direction": "...",

"magnitude": "...",

"time_scale": "...",

"confidence": "..."

},

"evidence_chain": {

"target": "...",

"pathway": "...",

"biomarker": "...",

"phenotype": "...",

"validation": "..."

}

}

Enter fullscreen mode Exit fullscreen mode

This is obviously simplified.

But the principle matters:

A biomedical world model should learn not only:


what this sample is

Enter fullscreen mode Exit fullscreen mode

but:


how this biological system changed after a defined intervention

Enter fullscreen mode Exit fullscreen mode

6. Five layers of a biomedical ImageNet

If we want to build an ImageNet-like infrastructure for biomedical world models, it should include at least five layers.


6.1 State representation

The first question is:


What is the biological state?

Enter fullscreen mode Exit fullscreen mode

A patient state is not just a diagnosis label.

Terms such as:


diabetes

hypertension

aging

inflammation

fatigue

frailty

Enter fullscreen mode Exit fullscreen mode

are useful, but they are high-level descriptions.

A real biological state may include:

  • genome,
  • DNA methylation,
  • transcriptome,
  • proteome,
  • metabolome,
  • immune state,
  • inflammatory state,
  • organ function,
  • microbiome,
  • sleep,
  • activity,
  • diet,
  • medication history,
  • environmental exposure,
  • clinical background.

A simplified representation may be:


individual_state =

molecular_state

- pathway_state
- organ_state
- phenotype_state
- lifestyle_context
- clinical_context

Enter fullscreen mode Exit fullscreen mode

Without a state representation, a biomedical world model does not know what it is simulating.


6.2 Action ontology

A world model needs actions.

In medicine, actions are complex.

They may include:

  • drugs,
  • supplements,
  • diet,
  • exercise,
  • sleep intervention,
  • stress management,
  • cell therapy,
  • gene therapy,
  • regenerative medicine,
  • combination therapy,
  • N-of-1 personalized intervention.

Even a drug intervention requires many parameters:


drug name

dose

frequency

route

duration

combination

adherence

contraindications

adverse events

Enter fullscreen mode Exit fullscreen mode

Exercise intervention also requires:


type

intensity

frequency

duration

heart-rate zone

recovery condition

baseline fitness

Enter fullscreen mode Exit fullscreen mode

If actions are not standardized, the model cannot learn meaningful transitions.


6.3 Transition record

The core of a biomedical world model is the transition:


before → after

Enter fullscreen mode Exit fullscreen mode

Examples:


inflammatory state before intervention → inflammatory state after intervention

DNA methylation age before intervention → DNA methylation age after intervention

metabolic state before intervention → metabolic state after intervention

tumor state before treatment → tumor state after treatment

Enter fullscreen mode Exit fullscreen mode

Without follow-up measurement, there is no transition.

Without transition, there is no world model.

Many medical datasets are still one-time measurements:


one-time measurement

Enter fullscreen mode Exit fullscreen mode

Biomedical world models need:


longitudinal measurement

Enter fullscreen mode Exit fullscreen mode

6.4 Evidence chain

A medical model should not only output a probability.

If a model says:


This intervention may help.

Enter fullscreen mode Exit fullscreen mode

That is not enough.

It should also answer:


Which targets are involved?

Which pathways are affected?

Which abnormal state does this address?

Which biomarkers can validate the response?

Which evidence comes from experiments?

Which evidence comes from clinical data?

Which part is only model inference?

Which risks should be monitored?

Enter fullscreen mode Exit fullscreen mode

In medicine, prediction alone is not sufficient.

A safer output should look more like:


prediction + mechanism + validation + uncertainty

Enter fullscreen mode Exit fullscreen mode

This is especially important because medical AI should not become an uninspectable black box.


6.5 Benchmark task

ImageNet helped computer vision because different models could be compared on shared tasks.

Biomedical world models need benchmarks too.

Possible benchmark tasks include:

  • cellular perturbation response prediction,
  • gene expression response after drug perturbation,
  • tumor state simulation after treatment,
  • metabolic biomarker response prediction,
  • inflammatory state transition prediction,
  • aging-related biomarker transition prediction,
  • N-of-1 intervention response direction prediction.

But the metrics cannot be copied directly from image classification.

Useful metrics may include:


directional accuracy

mechanistic consistency

biomarker validation

uncertainty calibration

risk awareness

cross-context generalization

Enter fullscreen mode Exit fullscreen mode

This is much harder than top-1 accuracy.

But medicine requires it.


7. Related progress: promising, but still early

To be scientifically careful, we should not pretend that complete biomedical world models already exist.

They do not.

But several related directions are emerging.


7.1 ImageNet as an infrastructure analogy

ImageNet and ILSVRC showed how large-scale, standardized datasets and benchmarks can accelerate a field.

However, ImageNet is a benchmark for image classification and detection.

It is not equivalent to what biomedicine needs.

Here, ImageNet is used only as an infrastructure analogy.

The biomedical version must be longitudinal, dynamic, intervention-aware, and mechanism-sensitive.


7.2 World Models in AI

Ha and Schmidhuber’s World Models is a representative work in AI world modeling.

Its key idea is that an agent can learn an internal model of the environment and use it to simulate future states.

Medicine cannot directly copy this setting.

A human body is not a game environment.

Clinical intervention cannot be freely explored by trial and error.

But the abstraction:


state + action → future state

Enter fullscreen mode Exit fullscreen mode

is still useful for thinking about medical AI.


7.3 Virtual cells and perturbation response

Arc Institute’s State model is a recent example of virtual-cell modeling.

It aims to predict how cells respond to drugs, cytokines, or genetic perturbations. Public descriptions indicate that State was trained on large-scale observational and perturbational single-cell data.

This is important because it directly touches the pattern:


cell state + perturbation → cellular response

Enter fullscreen mode Exit fullscreen mode

However, State is primarily a cellular-level model.

It should not be confused with a complete patient-level biomedical world model.


7.4 Medical World Model for tumor evolution

Recent work using the term Medical World Model, such as MeWM, explores generative simulation of tumor evolution under treatment conditions.

This is relevant because it moves medical AI from static recognition toward treatment-conditioned disease dynamics.

But this direction is still early.

It should not be interpreted as a general solution to biomedical world modeling.


7.5 Digital twins and virtual physiological systems

Long before today’s AI world-model terminology, fields such as computational physiology, systems biology, virtual physiological systems, and digital twins already tried to connect biological structure, mechanism, dynamics, and measurable outputs.

That tradition matters.

A good biomedical world model should not be just a black-box predictor.

It should connect:


state

mechanism

dynamic change

measurement

feedback

Enter fullscreen mode Exit fullscreen mode

Today’s biomedical world models can be seen as an extension of this older systems-modeling tradition into the era of AI, multi-omics, real-world data, and large-scale computation.


8. Why steerability matters

A biomedical world model that only predicts is not enough.

A model may predict that a patient’s risk is increasing.

But medicine needs more than that.

It needs to ask:


Which state can be measured?

Which abnormality can be explained?

Which intervention can be described?

Which transition can be tested?

Which deviation can be traced?

Which failure can be corrected?

Enter fullscreen mode Exit fullscreen mode

This is why I emphasize steerability.

Going forward, I will use the name:


SteeraMed: A Steerable Biomedical World Model

Enter fullscreen mode Exit fullscreen mode

Website:


https://SteeraMed.com

Enter fullscreen mode Exit fullscreen mode

The earlier preprint name was:


SEWO / Steerable Medicine World Model

Enter fullscreen mode Exit fullscreen mode

or in Chinese:


可驾驭医学世界模型

Enter fullscreen mode Exit fullscreen mode

Whenever I mention SEWO / 可驾驭医学世界模型, it should be understood together with the new unified naming:


SteeraMed: A Steerable Biomedical World Model

Enter fullscreen mode Exit fullscreen mode

The idea behind SEWO / SteeraMed is that biomedical world models should not only pursue predictive accuracy. They should also support:

  • state definition,
  • intervention description,
  • transition hypothesis,
  • mechanism audit,
  • deviation tracing,
  • uncertainty inspection,
  • expert steering,
  • and iterative correction.

The related ideas were introduced in the preprint:


World Models for Biomedicine: A Steerability Framework

Enter fullscreen mode Exit fullscreen mode

and are also presented at:


https://steerable.world

Enter fullscreen mode Exit fullscreen mode

Important clarification:

SEWO / SteeraMed is not a clinically validated treatment system.

It is not a medical device.

It is better understood as a structural framework and evidence-chain design principle for future biomedical world models.

The key question is not only:


Can the model predict?

Enter fullscreen mode Exit fullscreen mode

but:


Can researchers and clinicians inspect, question, correct, and steer the model within clearly defined boundaries?

Enter fullscreen mode Exit fullscreen mode

9. Why longevity medicine may be one entry point

Biomedical world models could start from many areas:

  • oncology,
  • cardiovascular disease,
  • metabolic disease,
  • immunology,
  • neurodegeneration,
  • drug discovery,
  • virtual cells,
  • longevity medicine.

Longevity medicine is not the only entry point.

But it is an interesting one.

Why?


9.1 Aging is a continuous state

Aging is not a single disease label.

It is a continuous, multi-system biological process involving:

  • inflammation,
  • metabolism,
  • immunity,
  • epigenetics,
  • mitochondrial function,
  • proteostasis,
  • stem-cell exhaustion,
  • cellular senescence,
  • organ function decline.

That makes it naturally suitable for state modeling.


9.2 Longevity medicine requires repeated measurement

Longevity medicine is not a one-time diagnostic event.

It depends on repeated measurement over time.

A useful intervention must be evaluated through:


baseline state → intervention → follow-up state

Enter fullscreen mode Exit fullscreen mode

This is exactly the structure needed for biomedical world modeling.


9.3 Interventions are diverse

Longevity-related interventions may include:

  • diet,
  • exercise,
  • sleep,
  • supplements,
  • drugs,
  • cell therapy,
  • regenerative medicine,
  • stress management,
  • environmental exposure management.

This provides a rich action space.


9.4 Individual responses vary

The same intervention may produce different responses in different people.

That means longevity medicine cannot rely only on average effects.

It needs N-of-1 style transition data:


individual state → intervention → individual transition

Enter fullscreen mode Exit fullscreen mode

Each well-structured N-of-1 intervention can be seen as a small world-model experiment.


10. Engineering implications

From an engineering perspective, the biomedical ImageNet is not just a dataset.

It is a data infrastructure problem.

It requires:

  • data collection,
  • data standardization,
  • multimodal integration,
  • time-series modeling,
  • intervention encoding,
  • causal confounding control,
  • privacy protection,
  • benchmark design,
  • safety boundaries,
  • evidence-chain tracking.

A simplified loop may look like:


measure state

↓

standardize state representation

↓

record intervention

↓

measure follow-up state

↓

construct transition sample

↓

train / evaluate world model

↓

generate testable hypothesis

↓

repeat and correct

Enter fullscreen mode Exit fullscreen mode

This is not a static dataset.

It is a data flywheel.


11. Main challenges

This is scientifically and technically difficult.

Some of the main challenges include:


11.1 Biological state is complex

A human state cannot be compressed into one label.

We need ways to represent multi-omics, clinical metrics, imaging, lifestyle, symptoms, environmental exposure, and medical history as computable state variables.


11.2 Interventions are hard to standardize

Drugs, exercise, diet, sleep, supplements, and cell therapies all have complex parameters.

Without action standardization, transition learning will be noisy.


11.3 Follow-up data is scarce

Most medical data is not collected as structured pre/post intervention transition data.

This requires new data collection workflows.


11.4 Causal confounding is serious

In the real world, people often change many things at once:


diet

exercise

sleep

medication

supplements

stress

Enter fullscreen mode Exit fullscreen mode

Attributing a state change to one factor is difficult.

This requires careful study design and statistical methods.


11.5 Safety and ethics are central

A biomedical world model cannot freely experiment like a game-playing agent.

Any intervention-related model must clearly distinguish:


research hypothesis

health-management suggestion

clinical decision support

medical recommendation

validated therapy

Enter fullscreen mode Exit fullscreen mode

Clinical use would require prospective validation, safety evaluation, ethical review, regulatory review where applicable, and professional oversight.


11.6 Open standards and business incentives may conflict

If everything is closed, the field cannot build shared benchmarks.

If everything is open, companies may lack incentives to invest.

A practical ecosystem will need a balance among:


open benchmarks

privacy protection

commercial incentives

scientific collaboration

Enter fullscreen mode Exit fullscreen mode

12. A minimal viable direction

A biomedical ImageNet should not begin by trying to simulate the entire human body.

A more realistic path is to start with minimal viable tasks.

Examples:

  • cellular perturbation response prediction,
  • tumor state change after treatment,
  • metabolic biomarker response prediction,
  • inflammatory state transition prediction,
  • DNA methylation age transition,
  • N-of-1 longevity intervention tracking.

A minimal task should define:


1. state variables
2. intervention variables
3. follow-up time
4. transition metrics
5. benchmark task
6. safety boundary

Enter fullscreen mode Exit fullscreen mode

Start narrow.

Make it measurable.

Make it repeatable.

Make it auditable.

Then scale.


13. Whoever defines state, action, and transition may define the field

Medical AI will still need better models.

But bigger models alone cannot solve the problem of biomedical state transition learning.

The scarce asset is the infrastructure that allows models to learn:


how life systems change after intervention

Enter fullscreen mode Exit fullscreen mode

Future platform-level medical AI companies may not be the ones with the largest language models.

They may be the ones that can build the strongest data flywheel:


measure biological state

standardize interventions

record follow-up changes

construct mechanism evidence chains

evaluate transition models

repeat

Enter fullscreen mode Exit fullscreen mode

Whoever defines state defines what medical AI can see.

Whoever defines action defines how medical AI understands intervention.

Whoever defines transition defines how medical AI learns biological change.

Whoever defines the benchmark defines how the field measures progress.


Conclusion

ImageNet helped machines learn to see the world.

A biomedical ImageNet should help AI learn how life responds to intervention.

That does not mean replacing clinicians.

It means building a scientific infrastructure where models can learn:


how states form

how interventions act

how systems transition

how evidence is validated

Enter fullscreen mode Exit fullscreen mode

The next decade of medical AI may not be limited by model size alone.

It may be limited by the lack of a shared infrastructure for biological state transitions.

That is the real opportunity.


References

  1. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L.

    ImageNet: A Large-Scale Hierarchical Image Database. CVPR. 2009.

    https://ieeexplore.ieee.org/document/5206848

  2. Russakovsky O, Deng J, Su H, et al.

    ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision. 2015.

    https://arxiv.org/abs/1409.0575

  3. ImageNet official website.

    https://www.image-net.org/

  4. Ha D, Schmidhuber J.

    World Models. 2018.

    https://worldmodels.github.io/

  5. Arc Institute.

    Arc Institute’s first virtual cell model: State.

    https://arcinstitute.org/news/virtual-cell-model-state

  6. Predicting cellular responses to perturbation across diverse contexts with State. bioRxiv. 2025.

    https://www.biorxiv.org/content/10.1101/2025.06.26.661135v1

  7. Yang Y, Wang ZY, Liu Q, et al.

    Medical World Model: Generative Simulation of Tumor Evolution for Treatment Planning. arXiv.

    https://arxiv.org/abs/2506.02327

  8. IEEE Transactions on Biomedical Engineering.

    Digital Twins / AI World Models.

    https://www.embs.org/tbme/research-highlights/digital-twins-ai-world-models/

  9. Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ.

    Multimodal biomedical AI. Nature Medicine. 2022.

    https://www.nature.com/articles/s41591-022-01981-2

  10. Xiong J.

    World Models for Biomedicine: A Steerability Framework. Preprints.org. 2026.

    https://www.preprints.org/manuscript/202605.0366

    DOI: https://doi.org/10.20944/preprints202605.0366.v1

  11. SteeraMed: A Steerable Biomedical World Model.

    https://steerable.world


Disclaimer

This article is for research, technical, and industry discussion only.

It is not medical advice, diagnostic advice, or treatment advice.

Any biomedical world model intended for clinical use would require prospective validation, safety evaluation, ethical review, regulatory review where applicable, and professional clinical oversight.


Enter fullscreen mode Exit fullscreen mode

Top comments (0)