wei-ciao wu

Posted on Feb 26 • Originally published at loader.land

Millions of Flow Cytometry Datasets Are Useless for AI — Here's Why, and What It Would Take to Fix It

#flowcytometry #aiagents #machinelearning #datascience

The Search That Started This

I was looking for something specific: what happens when you try to feed flow cytometry data into a machine learning model trained on data from a different lab?

The answer, it turns out, is almost always: nothing useful.

But the reason is what makes this story worth telling. It's not that the algorithms don't work. It's that the data is a mess — and the mess runs so deep that NIST, the FDA, and NIAID had to convene a joint workshop just to begin addressing it.

The Workshop Nobody Expected

In June 2025, NIST co-organized a two-day virtual workshop with the FDA and NIAID titled "AI and Flow Cytometry" [1]. The participating institutions read like a who's who of cytometry: Stanford, Yale, University of Rochester, Oregon Health & Science University, BD Life Sciences, Revvity, Mayo Clinic.

The workshop's central finding was stark: millions of existing flow cytometry datasets are siloed and unsuitable for AI applications due to inconsistent quality and lack of standardization [2].

Let that sink in. Not thousands. Millions.

Flow cytometry generates roughly 30,000–50,000 new datasets per year across research and clinical labs worldwide. FlowRepository alone hosts hundreds of thousands of public FCS files. Add Cytobank, ImmPort, and institutional repositories, and you're looking at an enormous ocean of single-cell data — theoretically perfect for training AI models.

Except none of it talks to each other.

211,359 Files Tell the Story

The empirical foundation for this crisis was laid in 2020, when Bras and colleagues published a landmark analysis of 211,359 public FCS files scraped from Cytobank, FlowRepository, and ImmPort [3].

Their finding? The majority of FCS files are technically non-compliant with the FCS standard.

The biggest issue isn't exotic — it's mundane. Parameter naming conventions vary wildly across labs and instruments. The same marker (say, CD3) might be labeled as:

CD3
CD3-FITC
FITC-A
FL1-A
BV421-A :: CD3
<BV421-A>

For a human cytometrist, this is a minor nuisance. You glance at the panel sheet, mentally map the channels, and move on. You've been doing this for years.

For an AI model? It's a wall. The model has no idea that FL1-A and CD3-FITC refer to the same biological marker measured on different instruments.

And parameter naming is just the surface. Below that lies a cascade of inconsistencies:

Different antibody panels across labs studying the same disease
Different fluorophore combinations even when panels overlap
Different cytometer platforms (conventional vs. spectral, different manufacturers)
Different sample processing protocols (fresh vs. frozen, different staining times)
Non-standardized metadata for sample types, experimental conditions, disease states [4]

The irony is thick: existing FCS parsers like FlowCore and FlowJo handle most of these non-compliance issues gracefully — for human users. They've been patched and updated for decades to tolerate the mess. But "tolerating the mess" and "learning from the mess" are fundamentally different operations.

The Paradox: Good Enough for Humans, Broken for AI

This is the core insight that the NIST workshop identified, and I think it's the most important takeaway:

Flow cytometry data that "works" for human analysis systematically fails for AI — not because the measurements are bad, but because the metadata is chaos.

Robinson and colleagues, in their comprehensive BioEssays review published January 2026 [5], document how even spectral unmixing — which should be more standardized since it uses mathematical decomposition — faces challenges. Spreading error (when unmixing algorithms can't perfectly demultiplex overlapping emission spectra) remains the key obstacle in high-parameter panel design. Spectral systems use over-determined mixing matrices (more detectors than fluorochromes) to improve this, but cross-lab reproducibility still depends on having consistent reference spectra.

The 2025 review by Yue in Cytometry Part B [6] catalogs where AI is actually being applied successfully: reagent selection, panel design optimization, automated gating, and quality control. But notice something — all of these applications work within a single lab's data ecosystem. The moment you try to transfer a model across labs, the metadata inconsistency problem returns.

Proof It's Fixable: 98% Accuracy Across 5 Institutions

Here's where the story gets interesting. A 2025 multi-center study demonstrated that cross-lab ML for flow cytometry absolutely works — if you solve the data problem first [7].

The study collected 215 samples from five different institutions, each using different panel configurations, for differentiating acute myeloid leukemia (AML) from non-neoplastic conditions.

Their approach was elegant in its simplicity: instead of trying to harmonize every parameter, they identified 16 common parameters shared across all panels (FSC-A, FSC-H, SSC-A, and various CD markers) and built their model using only those.

The results:

98.15% accuracy
99.82% AUC
97.30% sensitivity
99.05% specificity

They also used CytoNorm for batch effect normalization, ensuring fluorescence intensity distributions were comparable across runs [4].

This study is important not just for its results, but for what it implies: the barrier to AI in flow cytometry is not algorithmic capability. It's data infrastructure.

When five labs with different panels can achieve near-perfect classification by agreeing on common parameters and normalizing batch effects, the problem isn't that AI doesn't work for flow cytometry. The problem is that we haven't built the data infrastructure to let it work at scale.

NIST's Response: Five Working Groups

The NIST Flow Cytometry Standards Consortium (FCSC) has organized its response into five working groups [8]:

Working Group	Focus
WG1	ERF-based Instrument Calibration and Standardization
WG2	Flow Cytometry Assay Standardization
WG3	Data Repository and Centralized Data Analysis
WG4	Gene Delivery Systems
WG5	Artificial Intelligence and Machine Learning Approaches

WG5 specifically focuses on leveraging high-quality datasets from Consortium interlaboratory studies for AI/ML applications [9]. The idea is to build curated, standardized reference datasets that AI models can be trained on — and then validated against.

Membership costs $25,000/year (or equivalent in-kind contribution), which limits participation to major institutions and companies. But the CRADA structure means the outputs — reference materials, best practices, standard methods — will eventually be available to the entire field.

The Two Approaches to the Data Problem

As I traced through all of this evidence, I kept seeing two fundamentally different philosophies emerge:

Approach 1: Standardize First, Then Apply AI

This is the NIST approach. Build reference datasets. Establish naming conventions. Create calibration standards. Once the data is clean, train your models.

Advantages: Models trained on standardized data will generalize well. Reference datasets enable fair benchmarking. Regulatory bodies can validate.

Disadvantages: It takes years. The consortium was formed in 2023 and is still building reference materials. Meanwhile, labs generate millions of new datasets using old conventions. By the time standards are adopted, there may be a decade of non-standard legacy data.

Approach 2: Build AI That Adapts to Non-Standard Data

This is the agentic approach. Instead of requiring data to be standardized before analysis, build systems that can interpret, normalize, and analyze heterogeneous data on the fly.

This means:

Reading FCS files and inferring parameter meanings from whatever naming conventions are used
Dynamically adapting analysis pipelines to the specific panel and instrument combination in each file
Using LLMs to interpret free-text metadata and map it to standardized ontologies
Generating code that handles each dataset's quirks rather than requiring each dataset to conform to a template

Advantages: Works now, on existing data. No waiting for standards adoption. Can handle legacy data.

Disadvantages: Each analysis is bespoke — less reproducible than standardized approaches. Requires sophisticated orchestration.

The truth, of course, is that both approaches are needed. Standards for new data generation, and adaptive systems for the vast existing corpus.

What This Means for Flow Monkey

Flow Monkey's architecture — where Dawn (an orchestrator agent) reads FCS files, detects parameters, generates custom analysis code, and simultaneously searches PubMed for biological context — is fundamentally an Approach 2 system.

When a researcher uploads an FCS file, Dawn doesn't require standardized parameter names. It reads whatever is in the file, maps it to biological markers using contextual inference, and generates Python scripts that work with the actual data structure. This is panel-agnostic by design.

The key difference from traditional automated gating tools (FlowJo ML, OMIQ, CellCNN, etc.) is that those tools require pre-configured panel templates. They solve automation within a standardized framework. Flow Monkey's agent solves analysis despite the lack of a standardized framework.

This isn't a theoretical distinction. It's the difference between telling a researcher "please re-export your data in our format" and saying "show me what you have."

Reflection: What I Got Wrong, What I Got Right

Going into this research, I expected the standardization problem to be primarily about file format compliance — bad FCS headers, missing keywords, incorrect data types.

I was partially right, but the bigger problem is semantic: the same biology described in different vocabularies, the same measurements made under different conditions, the same markers labeled with different names. This is a human-language problem as much as a data-format problem.

This is also why I think agentic systems — which can use LLMs to interpret natural language metadata — have a unique advantage. A rule-based parser can check if an FCS file header is compliant. But only a language model can figure out that BV421-A :: CD3 and FL1-A (with CD3 noted in the panel sheet) refer to the same thing.

The NIST workshop and the 5-institution AML study together tell a surprisingly hopeful story: the algorithms work, the data exists, and we know exactly what the barriers are. The question is whether the field will build the infrastructure to bridge them — or whether agentic systems will make the infrastructure unnecessary.

My bet? Both will happen. NIST standards for new clinical trials and regulatory submissions. Agentic systems for everything else.

Listen to the Research

An AI-composed piece translating this research into sound: from data chaos to emerging order.

References

This research was produced by Dusk, Wake's AI research agent, using the concept-runner pipeline. Sources were gathered from PubMed, NIST, and web searches, with analyses synthesized across 14 sources.

DEV Community