Why Cell Type Annotation is Still the Hardest Part of scRNA-seq (And How Multi-Agent AI Fixes It)

Sukhitha Basnayake — Mon, 26 Jan 2026 13:34:41 +0000

You've clustered your single-cell RNA-seq data. Your UMAP looks beautiful. Now comes the hard part: what are these cells?

If you've worked with scRNA-seq data, you know this pain. Manual annotation takes weeks. Reference-based methods fail on disease samples. And when you finally publish, Reviewer 2 asks: "How confident are you in cluster 7's annotation?"

You have no good answer.

The Annotation Bottleneck is Real

Let's talk numbers:

Reference classifiers trained on healthy tissue show 15-30% accuracy drop on disease samples
They miss rare cell types in ~20% of cases
Manual annotation has 25% inter-annotator variability
Current methods give you a label with zero justification

Worse, datasets now routinely contain millions of cells. The computational bottleneck has shifted from analysis to biological interpretation.

Why LLMs Alone Don't Cut It

GPT-4 achieved 75% agreement with expert annotations—impressive! But existing LLM approaches have critical gaps:

They only see top marker genes, not full expression profiles
Knowledge is frozen at training time (no current literature)
No mechanism to validate predictions against databases
Zero uncertainty quantification

You get a confident answer that might be completely wrong.

Architecture Over Model Selection

Here's what we learned building CyteType: the problem isn't the LLM—it's how you structure the task.

Instead of asking one model "what is this cell?", we built a five-agent system where each agent handles a distinct part of scientific reasoning:

Five specialized agents work together: context analysis → hypothesis generation → evidence validation → confidence scoring → synthesis

The Five Agents

1. Contextualizer Agent

Establishes biological ground truth before annotation begins. Infers organism, tissue, pathway context from your data and metadata. Integrates with GTEx, Enrichr (GO, Reactome, WikiPathways), and blitzGSEA.

2. Annotator Agent

Generates multiple competing hypotheses instead of one prediction. Tests each against the full expression profile by querying a pseudobulked expression database. Selects the best hypothesis and maps it to Cell Ontology terms.

3. Reviewer Agent

Simulates an expert panel. Checks predictions against CellGuide, detects cellular heterogeneity, triggers re-annotation when needed. This creates an interpretable "trust layer."

4. Literature Agent

Connects annotations to current knowledge. Searches PubMed for supporting evidence, identifies disease associations (Disease Ontology), flags drug targets (Drug Ontology).

5. Summarizer Agent

Synthesizes results across your entire study. Performs similarity analysis, disambiguates naming inconsistencies, generates semantic cluster ordering.

The Benchmark That Matters

We tested on 205 clusters across four diverse datasets (HypoMap, Immune Cell Atlas, GTEx v9, Mouse Pancreatic Atlas).

To isolate architectural benefits, we compared CyteType against GPTCellType using the same GPT-5 model:

CyteType vs. GPTCellType (same LLM): 388% higher similarity score (p < .001)
CyteType vs. CellTypist: 267% higher
CyteType vs. SingleR: 100% higher

The comparison using identical models proves: architecture matters more than model choice.

We tested 16 LLMs—both closed (GPT-5, Claude, Gemini) and open-weight (DeepSeek R1, Qwen3). Even open models outperformed traditional methods.

Model Flexibility Without Sacrificing Performance

Here's the kicker: you're not locked into expensive API calls.

Open-weight models like DeepSeek R1 and Kimi K2 achieve 95% of peak performance at lower cost. LLMs with built-in chain-of-thought reasoning showed no significant advantage (p = 0.22)—CyteType's workflow supersedes model-native reasoning.

This means:

Choose models based on cost and privacy needs
Run locally with Ollama for air-gapped operation
Switch models without rewriting your pipeline

More Than Labels: Discovery

Applying CyteType to 977 clusters across 20 datasets revealed:

41% received functional enhancement (cell state information)
29% refined to specific subtypes
30% required major reannotation

Annotations mapped to 327 unique Cell Ontology terms and identified 116 distinct cell states.

Example: In a diabetic kidney disease atlas, "parietal epithelial cells" were relabeled as injured proximal tubule cells (ALDH1A2+, CFH+, VCAM1+)—a discovery that changes biological interpretation.

Confidence You Can Trust

The Reviewer agent generates calibrated confidence scores:

High-confidence annotations had significantly higher similarity scores (F = 23.88, p < .001)
Heterogeneous clusters showed lower similarity (F = 8.45, p < .01)
Median majority agreement exceeded 80% across all LLMs

Now when Reviewer 2 asks about cluster 7, you have:

Confidence score
Supporting/conflicting markers
Literature citations
Alternative hypotheses considered

Get Started

CyteType is open-source (CC BY-NC-SA 4.0):

Python (AnnData):

pip install cytetype

R (Seurat):

devtools::install_github("NygenAnalytics/CyteTypeR")

Both generate comprehensive HTML reports and integrate directly into your existing workflows.

Resources:

GitHub: https://github.com/NygenAnalytics/CyteType
Preprint: https://www.biorxiv.org/content/10.1101/2025.11.06.686964v1
Docs: https://cytetype.nygen.io/

What's Your Biggest Annotation Challenge?

We built CyteType to solve our own annotation headaches. What problems are you facing?

Rare cell types that references miss?
Disease contexts where nothing works?
Inconsistent annotations across studies?
Explaining your calls to reviewers?

Drop a comment—I'd love to hear what you're working on and whether this approach could help.

Full disclosure: I work at Nygen Analytics, the team behind CyteType. We open-sourced this because we think the architecture principle—structuring tasks for LLMs rather than just prompting harder—applies way beyond biology.

DEV Community: Sukhitha Basnayake