Most single-cell RNA-seq examples end with this pattern:
load data
preprocess
cluster cells
generate UMAP
rank marker genes
assign cell labels
That workflow is useful, but it leaves one important part underdeveloped: the reasoning step.
A cluster label is only meaningful if it is supported by marker-gene evidence.
The Single-Cell Marker Reasoning Benchmark turns that reasoning step into a reproducible benchmark.
Repository:
What the Project Does
The project starts with PBMC3k single-cell RNA-seq data and runs a Scanpy-based analysis workflow.
It then converts the marker-gene outputs into benchmark tasks.
The result is not just a single-cell analysis. It is an evaluation system for marker-gene reasoning.
Dataset
The project uses PBMC3k through Scanpy.
Raw dataset: 2700 cells × 32738 genes
Processed dataset: 2694 cells × 2000 highly variable genes
Clusters: 9 Leiden clusters
PBMC means peripheral blood mononuclear cells. These are immune cells from blood, which makes the dataset useful for marker-gene interpretation examples.
Analysis Workflow
The workflow includes:
PBMC3k loading
QC and preprocessing
normalisation
log transformation
highly variable gene selection
PCA
neighbour graph construction
UMAP
Leiden clustering
marker-gene ranking
marker filtering
cluster annotation
benchmark generation
Why Marker Filtering Was Added
Raw marker-gene outputs can contain genes that are not ideal for reasoning tasks.
Examples:
RPS*
RPL*
MT-*
MALAT1
TPT1
EEF1A1
B2M
These genes may reflect ribosomal signal, mitochondrial signal, housekeeping expression, or broad background activity.
The project keeps raw marker outputs but also creates filtered marker tables so benchmark tasks are based on more biologically useful signals.
Example Cluster Annotations
| Cluster | Annotation | Marker Evidence |
|---|---|---|
| 0 | T cells | CD3D, CD3E, IL7R, LTB |
| 1 | CD14+ monocytes | LYZ, S100A8, S100A9, FCN1 |
| 2 | B cells | CD79A, CD79B, MS4A1, CD74 |
| 4 | NK cells | NKG7, GNLY, GZMB, PRF1 |
| 7 | Platelets | PPBP, PF4, GNG11, SDPR |
These are marker-derived working annotations, not experimentally validated ground truth.
Benchmark Task Families
The project generates three benchmark task families.
1. Hidden Cluster Annotation
A solver receives marker genes and predicts the likely cell type.
Example:
CD79A, CD79B, MS4A1, CD74
Expected interpretation:
B cells
2. Marker Contradiction Detection
A solver checks whether marker evidence contradicts a proposed annotation.
Example:
Claim: B cells
Markers: NKG7, GNLY, GZMB, PRF1
The marker evidence supports NK or cytotoxic immune cells, not B cells.
3. Masked Marker Recovery
A solver receives partial marker evidence and recovers the likely biological identity.
This tests reasoning under incomplete evidence.
Public Tasks and Hidden Answers
The benchmark separates public task inputs from hidden answer keys.
benchmark_tasks/public/
benchmark_tasks/hidden/
benchmark_tasks/oracle_outputs/
Current benchmark size:
16 public tasks
16 hidden answers
This separation prevents answer leakage and makes the benchmark more credible.
Oracle Outputs
Oracle outputs provide reference-style answers.
They include:
predicted label
supporting genes
confidence
rationale
This allows the benchmark to support future model or human solver evaluation.
Validators and Scoring
The project includes:
src/scbench/validators.py
src/scbench/scoring.py
scripts/07_score_solver_answers.py
The scoring logic checks whether answers match the expected label, include supporting evidence, and provide reasoning.
Sample scoring result:
accuracy: 1.0
average score: 0.923
Testing and Reproducibility
The project includes:
pytest
Docker
Makefile
GitHub Actions CI
evidence files
Current test status:
36 passed
The Docker workflow validates that the project can run in a clean container environment.
Why This Project Is Different
A normal single-cell project usually produces:
clusters
UMAPs
marker tables
annotations
This project produces:
clusters
UMAPs
marker tables
filtered markers
annotations
benchmark tasks
hidden answers
oracle outputs
validators
scoring reports
calibration assets
Docker validation
CI validation
evidence documentation
Main Takeaway
The project demonstrates how a single-cell RNA-seq workflow can serve as a benchmark system.
Instead of only asking:
What are the clusters?
the benchmark asks:
Can a solver justify the cell-type interpretation from marker-gene evidence?
That shift moves the project from analysis to evaluation.
Top comments (0)