You've clustered your single-cell RNA-seq data. Your UMAP looks beautiful. Now comes the hard part: what are these cells?
If you've worked with scRNA-seq data, you know this pain. Manual annotation takes weeks. Reference-based methods fail on disease samples. And when you finally publish, Reviewer 2 asks: "How confident are you in cluster 7's annotation?"
You have no good answer.
The Annotation Bottleneck is Real
Let's talk numbers:
- Reference classifiers trained on healthy tissue show 15-30% accuracy drop on disease samples
- They miss rare cell types in ~20% of cases
- Manual annotation has 25% inter-annotator variability
- Current methods give you a label with zero justification
Worse, datasets now routinely contain millions of cells. The computational bottleneck has shifted from analysis to biological interpretation.
Why LLMs Alone Don't Cut It
GPT-4 achieved 75% agreement with expert annotations—impressive! But existing LLM approaches have critical gaps:
- They only see top marker genes, not full expression profiles
- Knowledge is frozen at training time (no current literature)
- No mechanism to validate predictions against databases
- Zero uncertainty quantification
You get a confident answer that might be completely wrong.
Architecture Over Model Selection
Here's what we learned building CyteType: the problem isn't the LLM—it's how you structure the task.
Instead of asking one model "what is this cell?", we built a five-agent system where each agent handles a distinct part of scientific reasoning:

Five specialized agents work together: context analysis → hypothesis generation → evidence validation → confidence scoring → synthesis
The Five Agents
1. Contextualizer Agent
Establishes biological ground truth before annotation begins. Infers organism, tissue, pathway context from your data and metadata. Integrates with GTEx, Enrichr (GO, Reactome, WikiPathways), and blitzGSEA.
2. Annotator Agent
Generates multiple competing hypotheses instead of one prediction. Tests each against the full expression profile by querying a pseudobulked expression database. Selects the best hypothesis and maps it to Cell Ontology terms.
3. Reviewer Agent
Simulates an expert panel. Checks predictions against CellGuide, detects cellular heterogeneity, triggers re-annotation when needed. This creates an interpretable "trust layer."
4. Literature Agent
Connects annotations to current knowledge. Searches PubMed for supporting evidence, identifies disease associations (Disease Ontology), flags drug targets (Drug Ontology).
5. Summarizer Agent
Synthesizes results across your entire study. Performs similarity analysis, disambiguates naming inconsistencies, generates semantic cluster ordering.
The Benchmark That Matters
We tested on 205 clusters across four diverse datasets (HypoMap, Immune Cell Atlas, GTEx v9, Mouse Pancreatic Atlas).
To isolate architectural benefits, we compared CyteType against GPTCellType using the same GPT-5 model:
- CyteType vs. GPTCellType (same LLM): 388% higher similarity score (p < .001)
- CyteType vs. CellTypist: 267% higher
- CyteType vs. SingleR: 100% higher
The comparison using identical models proves: architecture matters more than model choice.

We tested 16 LLMs—both closed (GPT-5, Claude, Gemini) and open-weight (DeepSeek R1, Qwen3). Even open models outperformed traditional methods.
Model Flexibility Without Sacrificing Performance
Here's the kicker: you're not locked into expensive API calls.
Open-weight models like DeepSeek R1 and Kimi K2 achieve 95% of peak performance at lower cost. LLMs with built-in chain-of-thought reasoning showed no significant advantage (p = 0.22)—CyteType's workflow supersedes model-native reasoning.
This means:
- Choose models based on cost and privacy needs
- Run locally with Ollama for air-gapped operation
- Switch models without rewriting your pipeline
More Than Labels: Discovery
Applying CyteType to 977 clusters across 20 datasets revealed:
- 41% received functional enhancement (cell state information)
- 29% refined to specific subtypes
- 30% required major reannotation
Annotations mapped to 327 unique Cell Ontology terms and identified 116 distinct cell states.
Example: In a diabetic kidney disease atlas, "parietal epithelial cells" were relabeled as injured proximal tubule cells (ALDH1A2+, CFH+, VCAM1+)—a discovery that changes biological interpretation.
Confidence You Can Trust
The Reviewer agent generates calibrated confidence scores:
- High-confidence annotations had significantly higher similarity scores (F = 23.88, p < .001)
- Heterogeneous clusters showed lower similarity (F = 8.45, p < .01)
- Median majority agreement exceeded 80% across all LLMs
Now when Reviewer 2 asks about cluster 7, you have:
- Confidence score
- Supporting/conflicting markers
- Literature citations
- Alternative hypotheses considered
Get Started
CyteType is open-source (CC BY-NC-SA 4.0):
Python (AnnData):
pip install cytetype
R (Seurat):
devtools::install_github("NygenAnalytics/CyteTypeR")
Both generate comprehensive HTML reports and integrate directly into your existing workflows.
Resources:
- GitHub: https://github.com/NygenAnalytics/CyteType
- Preprint: https://www.biorxiv.org/content/10.1101/2025.11.06.686964v1
- Docs: https://cytetype.nygen.io/
What's Your Biggest Annotation Challenge?
We built CyteType to solve our own annotation headaches. What problems are you facing?
- Rare cell types that references miss?
- Disease contexts where nothing works?
- Inconsistent annotations across studies?
- Explaining your calls to reviewers?
Drop a comment—I'd love to hear what you're working on and whether this approach could help.
Full disclosure: I work at Nygen Analytics, the team behind CyteType. We open-sourced this because we think the architecture principle—structuring tasks for LLMs rather than just prompting harder—applies way beyond biology.
Top comments (0)