Oluwagbade Odimayo

Posted on Jun 15

Can AI Reason From Marker Genes? Building a Single-Cell Benchmark From PBMC3k

#ai #datascience #python #science

Most single-cell RNA-seq examples end with this pattern:

load data
preprocess
cluster cells
generate UMAP
rank marker genes
assign cell labels

That workflow is useful, but it leaves one important part underdeveloped: the reasoning step.

A cluster label is only meaningful if it is supported by marker-gene evidence.

The Single-Cell Marker Reasoning Benchmark turns that reasoning step into a reproducible benchmark.

Repository:

Github

What the Project Does

The project starts with PBMC3k single-cell RNA-seq data and runs a Scanpy-based analysis workflow.

It then converts the marker-gene outputs into benchmark tasks.

The result is not just a single-cell analysis. It is an evaluation system for marker-gene reasoning.

Dataset

The project uses PBMC3k through Scanpy.

Raw dataset: 2700 cells × 32738 genes
Processed dataset: 2694 cells × 2000 highly variable genes
Clusters: 9 Leiden clusters

PBMC means peripheral blood mononuclear cells. These are immune cells from blood, which makes the dataset useful for marker-gene interpretation examples.

Analysis Workflow

The workflow includes:

PBMC3k loading
QC and preprocessing
normalisation
log transformation
highly variable gene selection
PCA
neighbour graph construction
UMAP
Leiden clustering
marker-gene ranking
marker filtering
cluster annotation
benchmark generation

Why Marker Filtering Was Added

Raw marker-gene outputs can contain genes that are not ideal for reasoning tasks.

Examples:

RPS*
RPL*
MT-*
MALAT1
TPT1
EEF1A1
B2M

These genes may reflect ribosomal signal, mitochondrial signal, housekeeping expression, or broad background activity.

The project keeps raw marker outputs but also creates filtered marker tables so benchmark tasks are based on more biologically useful signals.

Example Cluster Annotations

Cluster	Annotation	Marker Evidence
0	T cells	CD3D, CD3E, IL7R, LTB
1	CD14+ monocytes	LYZ, S100A8, S100A9, FCN1
2	B cells	CD79A, CD79B, MS4A1, CD74
4	NK cells	NKG7, GNLY, GZMB, PRF1
7	Platelets	PPBP, PF4, GNG11, SDPR

These are marker-derived working annotations, not experimentally validated ground truth.

Benchmark Task Families

The project generates three benchmark task families.

1. Hidden Cluster Annotation

A solver receives marker genes and predicts the likely cell type.

Example:

CD79A, CD79B, MS4A1, CD74

Expected interpretation:

B cells

2. Marker Contradiction Detection

A solver checks whether marker evidence contradicts a proposed annotation.

Example:

Claim: B cells
Markers: NKG7, GNLY, GZMB, PRF1

The marker evidence supports NK or cytotoxic immune cells, not B cells.

3. Masked Marker Recovery

A solver receives partial marker evidence and recovers the likely biological identity.

This tests reasoning under incomplete evidence.

Public Tasks and Hidden Answers

The benchmark separates public task inputs from hidden answer keys.

benchmark_tasks/public/
benchmark_tasks/hidden/
benchmark_tasks/oracle_outputs/

Current benchmark size:

16 public tasks
16 hidden answers

This separation prevents answer leakage and makes the benchmark more credible.

Oracle Outputs

Oracle outputs provide reference-style answers.

They include:

predicted label
supporting genes
confidence
rationale

This allows the benchmark to support future model or human solver evaluation.

Validators and Scoring

The project includes:

src/scbench/validators.py
src/scbench/scoring.py
scripts/07_score_solver_answers.py

The scoring logic checks whether answers match the expected label, include supporting evidence, and provide reasoning.

Sample scoring result:

accuracy: 1.0
average score: 0.923

Testing and Reproducibility

The project includes:

pytest
Docker
Makefile
GitHub Actions CI
evidence files

Current test status:

36 passed

The Docker workflow validates that the project can run in a clean container environment.

Why This Project Is Different

A normal single-cell project usually produces:

clusters
UMAPs
marker tables
annotations

This project produces:

clusters
UMAPs
marker tables
filtered markers
annotations
benchmark tasks
hidden answers
oracle outputs
validators
scoring reports
calibration assets
Docker validation
CI validation
evidence documentation

Main Takeaway

The project demonstrates how a single-cell RNA-seq workflow can serve as a benchmark system.

Instead of only asking:

What are the clusters?

the benchmark asks:

Can a solver justify the cell-type interpretation from marker-gene evidence?

That shift moves the project from analysis to evaluation.

DEV Community