DEV Community

Cover image for Can AI Reason From Marker Genes? Building a Single-Cell Benchmark From PBMC3k
Oluwagbade Odimayo
Oluwagbade Odimayo

Posted on

Can AI Reason From Marker Genes? Building a Single-Cell Benchmark From PBMC3k

Most single-cell RNA-seq examples end with this pattern:

load data
preprocess
cluster cells
generate UMAP
rank marker genes
assign cell labels
Enter fullscreen mode Exit fullscreen mode

That workflow is useful, but it leaves one important part underdeveloped: the reasoning step.

A cluster label is only meaningful if it is supported by marker-gene evidence.

The Single-Cell Marker Reasoning Benchmark turns that reasoning step into a reproducible benchmark.

Repository:

Github

What the Project Does

The project starts with PBMC3k single-cell RNA-seq data and runs a Scanpy-based analysis workflow.

It then converts the marker-gene outputs into benchmark tasks.

The result is not just a single-cell analysis. It is an evaluation system for marker-gene reasoning.

Dataset

The project uses PBMC3k through Scanpy.

Raw dataset: 2700 cells × 32738 genes
Processed dataset: 2694 cells × 2000 highly variable genes
Clusters: 9 Leiden clusters
Enter fullscreen mode Exit fullscreen mode

PBMC means peripheral blood mononuclear cells. These are immune cells from blood, which makes the dataset useful for marker-gene interpretation examples.

Analysis Workflow

The workflow includes:

PBMC3k loading
QC and preprocessing
normalisation
log transformation
highly variable gene selection
PCA
neighbour graph construction
UMAP
Leiden clustering
marker-gene ranking
marker filtering
cluster annotation
benchmark generation
Enter fullscreen mode Exit fullscreen mode

Why Marker Filtering Was Added

Raw marker-gene outputs can contain genes that are not ideal for reasoning tasks.

Examples:

RPS*
RPL*
MT-*
MALAT1
TPT1
EEF1A1
B2M
Enter fullscreen mode Exit fullscreen mode

These genes may reflect ribosomal signal, mitochondrial signal, housekeeping expression, or broad background activity.

The project keeps raw marker outputs but also creates filtered marker tables so benchmark tasks are based on more biologically useful signals.

Example Cluster Annotations

Cluster Annotation Marker Evidence
0 T cells CD3D, CD3E, IL7R, LTB
1 CD14+ monocytes LYZ, S100A8, S100A9, FCN1
2 B cells CD79A, CD79B, MS4A1, CD74
4 NK cells NKG7, GNLY, GZMB, PRF1
7 Platelets PPBP, PF4, GNG11, SDPR

These are marker-derived working annotations, not experimentally validated ground truth.

Benchmark Task Families

The project generates three benchmark task families.

1. Hidden Cluster Annotation

A solver receives marker genes and predicts the likely cell type.

Example:

CD79A, CD79B, MS4A1, CD74
Enter fullscreen mode Exit fullscreen mode

Expected interpretation:

B cells
Enter fullscreen mode Exit fullscreen mode

2. Marker Contradiction Detection

A solver checks whether marker evidence contradicts a proposed annotation.

Example:

Claim: B cells
Markers: NKG7, GNLY, GZMB, PRF1
Enter fullscreen mode Exit fullscreen mode

The marker evidence supports NK or cytotoxic immune cells, not B cells.

3. Masked Marker Recovery

A solver receives partial marker evidence and recovers the likely biological identity.

This tests reasoning under incomplete evidence.

Public Tasks and Hidden Answers

The benchmark separates public task inputs from hidden answer keys.

benchmark_tasks/public/
benchmark_tasks/hidden/
benchmark_tasks/oracle_outputs/
Enter fullscreen mode Exit fullscreen mode

Current benchmark size:

16 public tasks
16 hidden answers
Enter fullscreen mode Exit fullscreen mode

This separation prevents answer leakage and makes the benchmark more credible.

Oracle Outputs

Oracle outputs provide reference-style answers.

They include:

predicted label
supporting genes
confidence
rationale
Enter fullscreen mode Exit fullscreen mode

This allows the benchmark to support future model or human solver evaluation.

Validators and Scoring

The project includes:

src/scbench/validators.py
src/scbench/scoring.py
scripts/07_score_solver_answers.py
Enter fullscreen mode Exit fullscreen mode

The scoring logic checks whether answers match the expected label, include supporting evidence, and provide reasoning.

Sample scoring result:

accuracy: 1.0
average score: 0.923
Enter fullscreen mode Exit fullscreen mode

Testing and Reproducibility

The project includes:

pytest
Docker
Makefile
GitHub Actions CI
evidence files
Enter fullscreen mode Exit fullscreen mode

Current test status:

36 passed
Enter fullscreen mode Exit fullscreen mode

The Docker workflow validates that the project can run in a clean container environment.

Why This Project Is Different

A normal single-cell project usually produces:

clusters
UMAPs
marker tables
annotations
Enter fullscreen mode Exit fullscreen mode

This project produces:

clusters
UMAPs
marker tables
filtered markers
annotations
benchmark tasks
hidden answers
oracle outputs
validators
scoring reports
calibration assets
Docker validation
CI validation
evidence documentation
Enter fullscreen mode Exit fullscreen mode

Main Takeaway

The project demonstrates how a single-cell RNA-seq workflow can serve as a benchmark system.

Instead of only asking:

What are the clusters?
Enter fullscreen mode Exit fullscreen mode

the benchmark asks:

Can a solver justify the cell-type interpretation from marker-gene evidence?
Enter fullscreen mode Exit fullscreen mode

That shift moves the project from analysis to evaluation.

Top comments (0)