DEV Community

Cover image for Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4
Emmanuel Chima
Emmanuel Chima

Posted on

Cell-to-Sentence (C2S): LLM-Powered scRNA-seq Annotation with Gemma 4

Gemma 4 Challenge: Build With Gemma 4 Submission

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

Cell-to-Sentence (C2S) is an AI-powered annotation engine for single-cell RNA sequencing (scRNA-seq) data. It eliminates one of the most expensive bottlenecks in modern genomics: manually labelling what each cluster of cells is and does.
After computationally clustering cells, a trained bioinformatician must inspect marker gene lists, cross-reference databases like CellMarker and PanglaoDB, and formulate a biological interpretation of each cluster's identity and functional state. For a typical dataset this takes 4–8 hours and is highly dependent on domain expertise. C2S reduces this to under 2 minutes.

How it works:
Each cell's transcriptomic profile is converted into a "Cell Sentence"; a rank-ordered string of the most highly expressed gene symbols (e.g. CD8A GZMB PRF1 IFNG PDCD1 ...). This natural-language representation is then passed to Gemma 4, which uses its biomedical knowledge and structured chain-of-thought reasoning to return:

  • Cell type: e.g., CD8+ Cytotoxic T Cell

  • Functional state: e.g., Activated / Effector

  • Active pathways: e.g., T cell receptor signaling, Cytokine-mediated signaling

All pathway claims are validated against the Gene Ontology (GO) database to ensure scientific grounding, then the annotations are projected back onto a UMAP for publication-ready visualization.

What makes this different from existing tools?
Prior Cell-to-Sentence tools like CeLLama convert cell sentences into embedding vectors and find the nearest known-cell neighbour. That approach is fast but purely classificatory, it tells you what a cell is, but not why, cannot flag uncertainty, and cannot describe a cell's functional state. C2S uses Gemma 4's reasoning to explain a cell's phenotype, surface uncertainty explicitly, and ground every biological claim in the Gene Ontology

Demo

demo video

Code

kaggle notebook

How I Used Gemma 4

I chose Gemma 4 4B MoE (E4B) because the mixture-of-experts architecture gives it a far larger effective knowledge base than a 4B dense model, which matters enormously for biomedical reasoning. Recognising obscure gene symbols, understanding pathway crosstalk, and distinguishing cell states requires breadth that a small dense model simply lacks.
Critically, Gemma 4's <|thought|> structured reasoning was the deciding factor. When a cell sentence is ambiguous, for example, a cluster co-expressing both exhausted and effector T cell markers, Gemma 4 reasons through the tension explicitly before committing to an annotation. This is not possible with embedding-based approaches. The model's reasoning trace also serves as an audit trail, making the annotation scientifically defensible in a way that black-box classification cannot be.
The pipeline feeds each cell sentence as a structured prompt requesting a JSON response containing cell_type, functional_state, active_pathways, and an uncertainty_flag. This output is then parsed and validated against GO terms before being written back to the AnnData object.

Results
Against the CellTypist Pan_Immune v2 ground truth on a 70,000-cell PBMC dataset:

Gemma 4 Base (zero-shot): ARI=0.266, NMI=0.376, JSON parse rate=100%, GO verify rate=65.9%
Gemma 4 Fine-Tuned (QLoRA C2S): Improved GO verify rate=67.2%; ARI/NMI recovering post-fix
Top GO-verified pathways: T Cell Receptor Signaling (17 clusters), Cell Cycle Regulation (12), Plasma Cell Differentiation (11)

The confusion matrix in the notebook shows strong recall for monocytes (0.82), DCs (1.00), and the dominant "Other" T-cell blob (0.74), with NK and Platelet recall failing due to the coarse-mapping bugs now resolved.

Credits to my Team mate Andrew

Top comments (0)