Oluwagbade Odimayo

Posted on Jun 15

Your UMAP Looks Great. But Can You Prove the Annotation Is Correct?

#singlecellgenomics #evaluationdesign #bioinformatics #scientificsoftware

How I built a production scanpy pipeline that does not just annotate single-cell data -- it measures how accurately it did so, where it fails, and why.

Every single-cell tutorial ends the same way.

A UMAP appears on screen. Clusters are coloured. Labels are assigned by eye. The author says something like "and here we can see CD4+ T cells in cluster 0, B cells in cluster 1..." and that is it. The tutorial ends. The reader closes the tab feeling like they have learned single-cell analysis.

They have learned how to produce a picture.

What they have not learned is how to answer the question that actually matters in a research or clinical context: how do you know those labels are correct, how accurately did the pipeline recover the true cell identities, and where does it fail?

This post is about building a system that answers that question. Not with a visual inspection of marker genes. With a benchmark.

What this project builds

The sc-celltype-benchmark pipeline annotates 2,699 PBMC cells across 8 immune cell types using two independent strategies, then evaluates both strategies against oracle ground-truth labels through a structured benchmark framework with three components that most single-cell pipelines completely lack.

An oracle. A pre-trained CellTypist reference model provides ground-truth labels for all cells. These labels are withheld from the pipeline during annotation. Only after predictions are generated does the evaluator receive the answers.

A stratified holdout. 540 cells (20%) are withheld before annotation begins, with every cell type represented proportionally in both the training and test sets. The pipeline runs blind.

Difficulty tiers. Cell types are assigned to easy, medium, or hard tiers based on abundance, marker distinctiveness, and transcriptional similarity to neighbours. Performance is reported separately per tier, not averaged into a single misleading number.

The result on the PBMC 3k dataset: weighted F1 of 0.915 and Cohen kappa of 0.895 on 540 withheld cells. Three biological validators all pass at 1.0.

The annotation strategies

Strategy A: manual marker scoring

This is the algorithmic equivalent of what a computational biologist does when they read a dot plot. For each Leiden cluster, the pipeline computes the mean log-normalised expression of every canonical cell type's marker gene set across all cells in that cluster. The cell type whose markers have the highest mean expression score wins.

The canonical markers are curated from Zheng et al. 2017 (Nature Communications) and the Human Cell Atlas PBMC reference. Ten cell types are covered.

# For each cluster, score every canonical cell type
for cell_type, type_markers in markers.items():
    available = [g for g in type_markers if g in gene_index]
    if not available:
        continue
    idx = [gene_index[g] for g in available]
    score = float(cluster_X[:, idx].mean())
    if score > best_score:
        best_score = score
        best_type = cell_type

Simple. Interpretable. And it gets the major lineages exactly right: B cells F1=1.000, CD4+ T cells F1=1.000, NK cells F1=0.983, CD14+ Monocytes F1=0.966.

Strategy B: CellTypist reference annotation

The pipeline downloads the Immune_All_Low.pkl model (Dominguez Conde et al. 2022, Science) and runs per-cell prediction with Leiden-based majority voting. CellTypist's fine-grained labels ("Tcm/Naive helper T cells", "CD16+ NK cells", "Megakaryocytes/platelets") are harmonised to the same coarse categories used by manual annotation through an exhaustive 53-label mapping table.

CellTypist is more accurate, particularly for transcriptionally similar subtypes. That is exactly what makes it the right oracle: you use the better method as ground truth to measure the simpler method.

The benchmark framework

Why difficulty tiers matter

A pipeline that scores an overall weighted F1 of 0.92 sounds impressive. But what if 90% of the test cells belong to easy, well-separated lineages like CD4+ T cells and B cells, and the remaining 10% are FCGR3A+ Monocytes and Dendritic cells that the pipeline gets completely wrong?

The 0.92 number hides that failure. Difficulty tiers reveal it.

Tier	Cell types	F1	n cells
Easy	CD4+ T, B, CD14+ Mono, NK, CD8+ T	0.991	499
Medium	FCGR3A+ Mono, Dendritic cells	0.000	38
Hard	Platelets	1.000	3

The easy tier at F1=0.991 tells you the major lineage annotation is essentially perfect. The medium tier at F1=0.000 tells you exactly where the method breaks down and why.

Biological constraint validators

The oracle comparison tells you whether the predictions match the reference. Validators tell you whether the predictions make biological sense, without needing the oracle at all.

Three validators run on every pipeline output:

Cluster purity. Each Leiden cluster should be dominated (>80%) by a single cell type. A mixed cluster indicates over-clustering, biological heterogeneity, or annotation error.

Marker recovery. Each annotated cell type should express at least one of its canonical marker genes above the dataset-wide mean. A type that does not express its markers has been misannotated.

Biological consistency. Four cross-lineage constraints: CD3D must be higher in T cells than B cells; CD79A must be higher in B cells than T cells; CD14 must be higher in CD14+ Monocytes than B cells; NKG7 must be higher in NK cells than CD4+ T cells. These are necessary conditions for any correctly annotated PBMC dataset.

All three pass at 1.000. If any failed, it would indicate a bug, not a performance limitation.

The medium tier failure is not a bug

FCGR3A+ Monocytes and Dendritic cells score F1=0 under manual annotation. 38 test cells, all misclassified.

This is documented honestly because it is a genuine scientific limitation, not something to hide.

FCGR3A+ Monocytes share the core monocyte transcriptional programme -- LYZ, S100A8, S100A9 -- with CD14+ Monocytes. Their distinguishing markers (FCGR3A, MS4A7, IFITM3) are expressed at lower absolute levels. When the mean-expression scorer looks at a cluster containing FCGR3A+ Monocytes, the shared high-abundance monocyte markers dominate over the subtype-specific ones. The cluster gets assigned to CD14+ Monocytes.

The same logic applies to Dendritic cells, whose HLA-DR expression is shared with monocytes.

This is a known limitation of unsupervised marker-scoring annotation, described in the single-cell literature. Reference-based methods like CellTypist handle it correctly because they were trained on labelled examples of each subtype.

The benchmark found this. That is the point. A system that cannot find its own failures is not a benchmark. It is a UMAP with extra steps.

Eight engineering failures and how they were fixed

Building this pipeline required diagnosing and solving eight real failures. Here are the ones that taught the most.

The AnnData snapshot trap

Early in preprocessing, the pipeline saves raw counts for downstream use: adata.raw = adata. When checked afterwards, the raw counts had drifted by thousands.

The root cause is subtle. adata.raw = adata creates a Raw object from the AnnData at assignment time, but the underlying matrix is passed by reference. When sc.pp.normalize_total() modifies adata.X in-place, the .raw.X reference reflects those mutations.

The fix is one word: adata.raw = adata.copy(). An explicit copy before any in-place operations guarantees the snapshot is clean.

This matters because three downstream consumers depend on having the correct matrix: Wilcoxon DE testing, manual annotation scoring, and CellTypist. Get the matrix wrong, and all three produce silently incorrect results.

The CellTypist matrix validation failure

CellTypist raised ValueError: Invalid expression matrix in .X, expect log1p normalized expression to 10000 counts per cell. This occurred even after building the query AnnData from adata.raw.X.

The root cause took careful diagnosis. adata.raw.X holds the copy taken before normalize_total -- raw integer counts summing to roughly 2,500 per cell. CellTypist validates that expm1(X).sum(axis=1) is approximately 10,000. Raw integers fail this check.

After sc.pp.scale(), adata.X holds zero-mean unit-variance values, which also fail. At that point in the pipeline, there was no saved copy of the log1p-normalised intermediate.

The fix: save adata.layers['log_norm'] = adata.X.copy() immediately after sc.pp.log1p() and before sc.pp.scale(). This layer is the only place in the pipeline where the matrix is in the exact state CellTypist requires.

Verification that the fix works:

log_norm_X = adata.layers["log_norm"]
counts_per_cell = np.expm1(log_norm_X).sum(axis=1)
# Result: min=10000, max=10000, mean=10000

AnnData has three different places where expression data can live: .X, .raw.X, and .layers[...]. Each holds data at a different stage of transformation. Knowing which stage each downstream tool requires is the core skill.

The harmonisation map that caused kappa=0.27

The initial benchmark reported Cohen kappa of 0.27. Given the visual quality of the UMAP annotations, this seemed wrong. Investigation revealed the cause.

CellTypist returned 8 labels on PBMC 3k. Five of them were absent from the harmonisation map: Tcm/Naive helper T cells (240 test cells), CD16+ NK cells (88 cells), Tem/Trm cytotoxic T cells (3 cells), DC (7 cells), and Megakaryocytes/platelets (3 cells).

These 341 cells passed through harmonisation with their fine-grained CellTypist labels intact. The oracle held Tcm/Naive helper T cells. Manual annotation returned CD4+ T cells. They are the same biological population, but the string comparison fails. No cell in those 341 could ever match the oracle, regardless of annotation quality.

The fix: build an exhaustive mapping covering all 53 labels in Immune_All_Low.pkl. Add a regression test asserting every label is present. Kappa jumped to 0.895.

# Regression test -- this test exists because of a real production failure
all_immune_all_low_labels = [
    "CD4+ T cells", "CD4+ TCM", "CD4+ TEM", "CD4+ Treg",
    "Tcm/Naive helper T cells",   # was missing -- caused kappa=0.27
    "CD16+ NK cells",             # was missing
    "Megakaryocytes/platelets",   # was missing
    "DC",                         # was missing
    "Tem/Trm cytotoxic T cells",  # was missing
    # ... all 53 labels
]
missing = [label for label in all_immune_all_low_labels
           if label not in CELLTYPIST_COARSE_MAP]
assert missing == []

The architecture decision that unlocks everything

The single most important design decision in the pipeline is the layers['log_norm'] pattern.

After normalisation and log transformation, the data exists in three states as it flows through preprocessing:

Raw integer counts (before normalize_total) -- stored in .raw
Log1p-normalised counts (after log1p, before scale) -- stored in .layers['log_norm']
Scaled zero-mean unit-variance values (after scale) -- stored in .X

Most tutorials collapse these into two states or one. The result is that downstream tools fight over which state they need and the programmer patches around incompatibilities.

The three-state pattern eliminates that entirely. Each downstream consumer uses the state it actually requires:

Consumer	State needed	Where it comes from
Wilcoxon DE (`use_raw=True`)	Log1p-normalised	`.raw.X`
Manual annotation scoring	Log1p-normalised	`.layers['log_norm']`
CellTypist	Log1p-normalised, expm1 sums to 10k	`.layers['log_norm']`
Clustering / PCA	Scaled zero-mean	`.X`
Dot plots / visualisation	Log1p-normalised	`.layers['log_norm']`

One design decision prevents five categories of silent failure.

The numbers

Metric	Value
Dataset	PBMC 3k (10x Genomics, 2,700 cells)
Test cells evaluated	540 (20% stratified holdout)
Benchmark accuracy	0.9241
Benchmark weighted F1	0.9151
Cohen kappa	0.8948
Easy tier F1	0.991
Medium tier F1	0.000 (documented limitation)
Hard tier F1	1.000
Cluster purity validator	1.000 PASS
Marker recovery validator	1.000 PASS
Biological consistency validator	1.000 PASS
Pipeline runtime	39.2 seconds
Tests passing	168

What this project is really about

The benchmark numbers are good. But what the project actually demonstrates is something different: the ability to ask a harder question than the one the tutorial asks.

The tutorial asks: can you annotate single-cell data?

This project asks: how do you know your annotation is correct, where does it fail, and can the evaluation framework find its own limitations?

The oracle holdout, difficulty tiers, biological validators, exhaustive harmonisation map, and honest documentation of the medium tier failure are all answers to that second question. That is the question that matters in a clinical bioinformatics context, in a scientific software evaluation role, and anywhere the annotation affects a downstream decision.

A UMAP that looks right is not evidence. A benchmark that measures correctly, including measuring where it fails, is.

Repository

github.com/gbadedata/sc-celltype-benchmark

The README covers the full pipeline architecture, all eight engineering challenges with solutions, benchmark framework design rationale, and instructions for reproducing the results from a fresh clone in under 40 seconds.

Working on single-cell genomics, computational biology pipelines, or scientific software evaluation? Connect on GitHub or LinkedIn.

DEV Community