When 10,000 Cells Tell 10,000 Different Stories
Imagine you're trying to understand a bustling city. You could look at it from space and see general patternsโwhere the downtown is, where the suburbs spread out. But to really understand the city, you need to meet its inhabitants: the barista brewing your morning coffee, the doctor saving lives at the hospital, the teacher shaping young minds. Each person has a unique story, a unique purpose.
Your body is that city, and cells are its inhabitants. For decades, scientists studied cells in bulkโlike viewing that city from space. But what if we could listen to each cell individually? What if every cell could tell us its unique story through the genes it's expressing at that very moment?
Welcome to the world of single-cell RNA sequencing (scRNA-seq), where we do exactly that.
๐ต The Power of Listening to Individual Voices
Traditional bulk RNA sequencing is like recording a symphony orchestra but only having one microphone placed in the middle of the concert hall. You hear music, sure, but can you distinguish the violins from the cellos? The flutes from the trumpets?
Single-cell RNA-seq is like giving each musician their own microphone. Suddenly, you can:
- โ Identify who's playing what
- โ Discover that one amazing violinist playing a slightly different melody
- โ Realize the person you thought was a trumpeter is actually playing the saxophone
In our bodies, this means discovering:
- Rare cell types that were invisible in bulk measurements
- Cell states (like cells transitioning from resting to activated)
- Disease mechanisms hidden within specific cell populations
- How cells respond differently to the same treatment
๐ฏ Our Mission: Decoding Human Blood Cells
For this journey, we're analyzing peripheral blood mononuclear cells (PBMCs)โthe immune cells floating in your bloodstream right now. These are your body's defense force: the soldiers, scouts, and command centers of your immune system.
We start with approximately 10,000 cells captured using 10x Genomics technology, which works like this:
Individual Cell โ Droplet โ Barcode โ Sequence โ Digital "Fingerprint"
Each cell gets trapped in its own microscopic droplet with a unique barcode, like giving each person a numbered ticket. When we sequence the RNA, we know exactly which cell each piece of genetic code came from.
๐ The Data We're Working With
๐ Dataset: 10,000+ individual PBMC cells
๐งฌ Genes tracked: ~20,000 per cell
๐ Total measurements: 200+ million data points
๐พ File size: Several hundred megabytes
Think about that for a second: we're measuring the activity of 20,000 genes in 10,000 individual cells. That's like reading 200 million different sentences to understand the story of your immune system!
๐ Full Tutorial: The complete Jupyter notebook with all code and visualizations is available on GitHub. Follow along and run the analysis yourself!
Part 1: The Setup - Building Our Laboratory in the Cloud โ๏ธ
Since we're working with massive datasets, we need a computational laboratory. We're using Google Colab, which is essentially a computer in the cloud that we can access from anywhere.
Why Google Drive Integration Matters
from google.colab import drive
drive.mount('/content/drive')
This simple act of "mounting" Google Drive is like claiming a permanent workspace. Without it, every time our coding session ends (Colab disconnects after a few hours of inactivity), we'd lose everythingโinstalled packages, downloaded data, analysis results. It's like being a scientist whose lab vanishes every evening and reappears empty the next morning!
๐ง Our Toolkit: The Scientific Instruments
We're assembling quite the collection:
install_if_missing("scanpy") # The Swiss Army knife of scRNA-seq
install_if_missing("anndata") # Our data container
install_if_missing("igraph") # For finding communities
install_if_missing("celltypist") # AI-powered cell identification
install_if_missing("decoupler") # Pattern recognition engine
Each library is like a specialized instrument:
- Scanpy: Your all-in-one analysis platform (like a high-powered microscope)
- AnnData: The filing system that keeps everything organized
- igraph: A social network analyzer for cells (who hangs out with whom?)
- Celltypist: An AI that learned what different cell types look like
- Decoupler: A pattern matcher trained on thousands of experiments
Scanpy โ Single-Cell Analysis in Python
Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing The Python-based implementation efficiently deals with datasets of more than one million cells. For datasets too large to fit into memory, many scanpy functions are now compatible with dask (warning: experimental).
Discuss usage on the scverse Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contribution guide.
scanpy is part of the scverseยฎ project (website, governance) and is fiscally sponsored by NumFOCUS If you like scverseยฎ and want to support our mission, please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a varietyโฆ
Part 2: Quality Control - Separating Signal from Noise ๐
Not every cell we capture is high quality. Some are damaged during collection, some are actually two cells stuck together (doublets), and some are dying as we speak. We need to filter these out.
The Three Red Flags ๐ฉ
1. Mitochondrial Genes (The Dying Cell Signature)
fresh_blood_adata.var['MT'] = fresh_blood_adata.var_names.str.startswith("MT-")
Why this matters: Imagine a sinking ship. The hull breaches, and water rushes in. The cargo floats away, but the heavy machinery stays on board. That's what happens to dying cells:
- The cell membrane becomes leaky (the hull breaches)
- Cytoplasmic RNA escapes (cargo floats away)
- Mitochondrial RNA stays trapped inside (heavy machinery)
If more than 10-20% of a cell's RNA is mitochondrial, it's probably dying. We mark these cells for exclusion.
Healthy Cell:
๐ต๐ต๐ต๐ต๐ต๐ต๐ต๐ต๐ต๐ด (10% MT = good)
Dying Cell:
๐ต๐ต๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด (80% MT = bad)
2. Ribosomal Genes (The House Cleaning Noise)
Ribosomal proteins are like the cleaning staff of the cellโessential but not particularly informative about what makes a T cell different from a B cell. Too much ribosomal RNA can drown out the interesting biological signals.
Why this matters: Imagine you're trying to identify different professions by listening to what people talk about:
- Doctor: "surgery," "diagnosis," "patient care" โ (distinctive)
- Teacher: "curriculum," "students," "lesson plans" โ (distinctive)
- Everyone: "commute," "lunch," "email," "meetings" โ (not distinctive)
Ribosomal genes are the cellular equivalent of "everyone talks about this." They're housekeeping genes involved in protein synthesisโabsolutely critical for cell survival, but expressed at high levels in ALL cells regardless of type.
The ribosomal noise problem:
Gene Expression Profile:
Biological Signal Genes: Ribosomal Genes:
CD3D, CD4, CD8 (T cell markers) RPL/RPS genes (protein machinery)
๐ PROBLEMATIC CELL (High Ribo):
Ribosomal: โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ 70%
T markers: โโโโโโ 20%
Other: โโโ 10%
โ Signal buried in noise!
๐ CLEAN CELL (Low Ribo):
T markers: โโโโโโโโโโโโโโโโโโโโ 60%
Other: โโโโโโโโโโโ 30%
Ribosomal: โโโ 10%
โ Signal clearly visible!
Common ribosomal genes we filter:
RPL3, RPL4, RPL5... (Large ribosomal subunit proteins)
RPS2, RPS3, RPS4... (Small ribosomal subunit proteins)
These can represent 30-50% of total mRNA in some cells!
Visual comparison:
Cell Type Distinction WITHOUT Ribo Filtering:
Expression Level
โ
High โ โโโโ โโโโ โโโโ โ All cells look similar
โ โโโโ โโโโ โโโโ (dominated by ribosomal signal)
Low โ โโโโ โโโโ โโโโ
โโโโโโโโโโโโโโโโโ
T B NK
Cell Type Distinction WITH Ribo Filtering:
Expression Level
โ
High โ โโโโ โ Clear differences!
โ โโโโ โโโโ (distinctive markers visible)
Low โ โโโโ โโโโ โโโโ
โโโโโโโโโโโโโโโโโ
T B NK
CD3 CD79 NKG7
The filtering strategy:
# Mark ribosomal genes
fresh_blood_adata.var['RIBO'] = fresh_blood_adata.var_names.str.startswith(('RPS', 'RPL'))
# Remove cells with >2% ribosomal content
adata = adata[adata.obs['pct_counts_RIBO'] < 2]
Why 2% threshold?
Distribution of Ribosomal RNA Content:
Cellsโ Normal Technical artifacts
โ โฑโฒ โฑโฒ
โ โฑ โฒ โฑ โฒ
โ โฑ โฒ______โฑ โฒ___
โ____โฑ โฒ____
0% 1% 2% 3% 4% 5% โ
โ
2% cutoff
Keep โ Remove โ
Most healthy cells: 0.5-1.5% ribosomal RNA
Problematic cells: >2% (technical issues during capture)
3. Hemoglobin Genes (The Contamination Signal)
In blood samples, red blood cells can break open and spill their hemoglobin everywhere. It's like trying to study fish in an aquarium but the water is cloudy with algae. We need to filter this contamination.
Why this matters: Red blood cells (erythrocytes) are the most abundant cells in bloodโabout 5 million per microliter! But they're not immune cells and shouldn't be in our PBMC sample. When they lyse (break open) during sample processing, they release massive amounts of hemoglobin that can contaminate other cells.
The contamination cascade:
DURING SAMPLE PROCESSING:
Step 1: Blood draw
๐ด๐ด๐ด๐ด๐ด๐ด๐ด โ Millions of red blood cells
๐ต ๐ข ๐ก โ PBMCs (what we want)
Step 2: Density gradient separation
โโโโโโโโโโโโโ
๐ต ๐ข ๐ก โ PBMCs float (good!)
โโโโโโโโโโโโโ
๐ด๐ด๐ด๐ด๐ด โ RBCs sink (good!)
Step 3: Collection (sometimes incomplete)
๐ต ๐ข ๐ก ๐ฅ๐ด โ Some RBCs break open
โ
Hemoglobin spills out!
Step 4: Hemoglobin contamination
๐ต๐ง ๐ข๐ง ๐ก โ PBMCs coated with hemoglobin
โ
HBA, HBB genes detected in immune cells (BAD!)
Hemoglobin gene expression pattern:
๐ CONTAMINATED CELL:
Hemoglobin: โโโโโโโโโโโโโโโโโโโโโโ 50% โ Red flag!
(HBA1, HBA2, HBB, HBD)
Immune: โโโโ 10%
Other: โโโโโโโโ 40%
๐ CLEAN IMMUNE CELL:
Hemoglobin: โ <0.1% โ Should be near zero!
Immune: โโโโโโโโโโโโโโโโ 70%
Other: โโโโโโ 30%
Common hemoglobin genes:
HBA1 โโ
HBA2 โโค Alpha globin chains
โ (Should NOT be in lymphocytes!)
HBB โโค
HBD โโ Beta globin chains
The detection pattern:
UMAP Plot - Before Hemoglobin Filtering:
โ ๐ต๐ต๐ต โ T cells (good)
โ ๐ต๐ต
โ
โ ๐ข๐ข โ B cells (good)
โ ๐ข๐ข
โ
โ โโโ โ Mystery cluster
โ โโโ (HIGH HBA/HBB!)
โ โโโ
โโโโโโโโโโโโโโโโโโโโโ
Gene Expression in Mystery Cluster:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gene | % Expressed | Avg Exp โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ HBA1 | 95% | 9.2 โ ๐จ
โ HBA2 | 92% | 8.8 โ ๐จ
โ HBB | 89% | 8.1 โ ๐จ
โ CD3D | 1% | 0.1 โ (no T markers)
โ CD79A | 2% | 0.1 โ (no B markers)
โ CD14 | 3% | 0.2 โ (no myeloid)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Conclusion: Red blood cell contamination or ambient RNA!
Ambient RNA problem:
What is "ambient RNA"?
During scRNA-seq:
Droplet with healthy cell:
โโโโโโโโโโโโโโโ
โ ๐ต โ โ Immune cell
โ โ
โ Clean! โ
โโโโโโโโโโโโโโโ
Droplet with contamination:
โโโโโโโโโโโโโโโ
โ ๐ต ๐ง๐ง โ โ Cell + hemoglobin
โ ๐ง๐ง โ floating in solution
โ Problem! โ
โโโโโโโโโโโโโโโ
The hemoglobin RNA gets captured along with the cell's RNA!
Filtering strategy:
# Mark hemoglobin genes
fresh_blood_adata.var['HB'] = fresh_blood_adata.var_names.str.contains('^HB[^(P)]')
# Remove cells with >0.1% hemoglobin
adata = adata[adata.obs['pct_counts_HB'] < 0.1]
Why 0.1% threshold?
Hemoglobin Content Distribution:
Cellsโ Clean cells Contaminated
โ
โ โฑโฒ
โโฑ โฒ โ โ
โ โฒ___ โ โ โ
โ โฒ_________โ______โ___
0% 0.05% 0.1% 0.5% 1% 5% โ
โ
0.1% cutoff
Keep โ Remove โ
Clean immune cells: <0.05% hemoglobin
Contaminated cells: >0.1% hemoglobin
Pure RBC debris: >1% hemoglobin
Real-world impact:
CASE STUDY: The Phantom Cluster
Initial Analysis (no HB filtering):
15 clusters detected
Cluster 12: ~400 cells
Cluster 12 characteristics:
โ No T cell markers
โ No B cell markers
โ No NK markers
โ No monocyte markers
โ High HBA1, HBA2, HBB
Gene Ontology Enrichment:
โ "oxygen transport"
โ "hemoglobin complex"
โ "erythrocyte differentiation"
Conclusion: Not a real immune cell type!
After HB filtering:
14 clusters detected โ
Cluster 12: REMOVED
Clean analysis of actual immune cells!
The triple filter visualization:
Quality Control Cascade:
START: 10,000 cells
โ
โโ Filter MT > 20% โโโโโ Remove 800 dying cells
โ
โโ Filter RIBO > 2% โโโโ Remove 500 noisy cells
โ
โโ Filter HB > 0.1% โโโโ Remove 400 contaminated cells
โ
โ
END: 8,300 high-quality cells โ
Cell Quality Score Formula:
Q = -wโ(MT%) - wโ(RIBO%) - wโ(HB%) + wโ(n_genes)
Where:
wโ = 2.0 (MT% most important - indicates death)
wโ = 1.0 (RIBO% moderate - indicates noise)
wโ = 1.5 (HB% high importance - indicates contamination)
wโ = 0.5 (n_genes - more genes = better)
Biological interpretation:
Why each filter matters for cell type identification:
WITHOUT FILTERS: WITH FILTERS:
"T cell" cluster True T cell cluster
Mixed signals: Clear signal:
โโ CD3D (real T marker) โโ CD3D โ
โโ HBA1 (contamination) โโ CD3E โ
โโ RPL3 (noise) โโ CD4 โ
โโ MT-CO1 (dying cells) โโ IL7R โ
Confusing! โ Perfect! โ
The Quality Control Thresholds: Finding the Sweet Spot
After identifying these red flags, we need to decide where to draw the line. This is like airport security deciding how suspicious someone needs to be before additional screening.
# Set our quality control thresholds
fresh_blood_adata = fresh_blood_adata[
(fresh_blood_adata.obs['pct_counts_MT'] < 20) &
(fresh_blood_adata.obs['pct_counts_RIBO'] < 2) &
(fresh_blood_adata.obs['pct_counts_HB'] < 0.1) &
(fresh_blood_adata.obs['n_genes_by_counts'] > 200) &
(fresh_blood_adata.obs['n_genes_by_counts'] < 6000)
]
Visual representation of our filtering criteria:
CELL QUALITY SPECTRUM
โ TOO LOW โ
JUST RIGHT โ TOO HIGH
<200 genes 200-6000 genes detected >6000 genes
(empty/broken) (healthy cells) (doublets)
โ โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ
Mitochondrial RNA %:
โ >20% โ
<20%
(dying cell) (healthy cell)
๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด๐ด ๐ต๐ต๐ต๐ต๐ต๐ต๐ต๐ต๐ด๐ด
(80% MT-bad) (20% MT-good)
The Filtering Impact: Before and After
Here's what happens to our dataset:
๐ BEFORE FILTERING:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Total cells: 10,000 โ
โ Mean genes/cell: 2,847 โ
โ Median MT%: 4.2% โ
โ Suspected doublets: ~500 โ
โ Low quality: ~800 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ APPLY FILTERS โ
๐ AFTER FILTERING:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Total cells: 8,200 โ โ
โ Mean genes/cell: 3,124 โ โ
โ Median MT%: 3.1% โ โ
โ Clean, high-quality data! ๐ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
What we removed:
- ~500 doublets (two cells masquerading as one)
- ~800 low-quality/dying cells (high MT%, low gene counts)
- ~500 contaminated cells (high hemoglobin)
Why this matters: Garbage in, garbage out. If we include dying cells in our T cell cluster, we might falsely conclude that T cells express death-associated genes when they actually don't!
Interactive Visualization: The Filtering Dashboard
Imagine looking at a scatter plot where each point is a cell:
Mitochondrial % vs Genes Detected
MT%
30โ โ โ dying cells
โ โ โ
25โ โ โ โ
โ โ โ
20โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Filter threshold
โ โโโโโโโโโโโ
15โ โโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโ โ healthy cells
10โโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโ
5โโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโ
0โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Genes detected
0 1k 2k 3k 4k 5k 6k 7k
โ
doublets โ
The beauty of this scatter plot:
- Bottom left: Empty droplets/debris (low genes, low MT)
- Middle band: Healthy cells (moderate genes, low MT) โ
- Top right: Dying cells (variable genes, high MT)
- Far right: Doublets (too many genes)
Gene-Level Filtering: Removing Uninformative Genes
We also filter genes, not just cells:
# Remove genes detected in fewer than 3 cells
sc.pp.filter_genes(fresh_blood_adata, min_cells=3)
Why remove rare genes?
Gene X detected in 2 cells out of 8,200:
Scenario 1: Technical noise
Cell_0042: "X" โ Sequencing error
Cell_5891: "X" โ Random spurious detection
Scenario 2: Extremely rare (can't analyze statistically)
Even if real, 2 cells is insufficient for robust analysis
The filtering cascade:
GENE FILTERING CASCADE
Start: 20,000 genes
โ
Remove genes in <3 cells
โ (remove ~5,000 rare/noise genes)
โ
15,000 genes remain
โ
Keep highly variable genes (later step)
โ (select ~2,000-3,000 informative genes)
โ
Final: 2,000 genes for analysis
This is like going from every word in the dictionary
to just the keywords that define each cell type!
The Scrublet Score Distribution
When we run doublet detection, we get a score for each cell:
Doublet Score Distribution
Cellsโ Real cells Doublets
โ โฑโฒ โฑโฒ
โ โฑ โฒ โฑ โฒ
โ โฑ โฒ___________โฑ โฒ
โ โฑ โฒ
โ____โฑ_________________________โฒ___
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Doublet Score โ
โโโโโโโโโโโคโโโโโโโโโโโโโโโโโโโโค
Keep โ Remove (threshold) โ
Reading this plot:
- Left peak: Single cells (low doublet scores) โ
- Right peak/tail: Likely doublets (high scores) โ
- Threshold (typically 0.25-0.30): Decision boundary
The Complete Quality Control Workflow Visualized
QC WORKFLOW
๐ฅ Raw Data
10,000 cells
โ
โโโโ [Calculate QC metrics]
โ โข Count genes per cell
โ โข Calculate MT%
โ โข Calculate RIBO%
โ โข Calculate HB%
โ
โโโโ [Detect doublets]
โ โข Run Scrublet
โ โข Flag suspicious cells
โ
โโโโ [Visualize distributions]
โ โข Violin plots
โ โข Scatter plots
โ โข Histograms
โ
โโโโ [Set thresholds]
โ โข MT% < 20%
โ โข Genes: 200-6000
โ โข RIBO% < 2%
โ โข HB% < 0.1%
โ โข Doublet score < 0.25
โ
โโโโ [Apply filters]
โ โข Remove 1,800 cells
โ
โ
โ
Clean Data
8,200 high-quality cells
โ
โ
[Continue to normalization...]
Real Example: The Contamination Story
Here's a real scenario from our analysis:
CLUSTER 7: The Mystery Cluster
Initial observation:
โข 450 cells
โข High hemoglobin genes (HBA1, HBA2, HBB)
โข Low immune markers
โข Suspiciously uniform
Investigation:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Gene | % Expressed | Avg Exp โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฃ
โ HBA1 | 92% | 8.7 โ โ Red flag!
โ HBA2 | 89% | 8.2 โ โ Red flag!
โ HBB | 87% | 7.9 โ โ Red flag!
โ CD3D | 2% | 0.1 โ โ Should be higher
โ CD79A | 1% | 0.0 โ โ Should be higher
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Conclusion: Red blood cell contamination!
Action: Filter out this cluster entirely
Result: Clean immune cell analysis โ
The Mathematical Beauty of QC
Quality control involves elegant statistical reasoning:
For each cell, we calculate a quality score:
Q = wโ(genes_normalized) - wโ(MT_pct) - wโ(doublet_score)
Where:
genes_normalized = (genes - min) / (max - min)
MT_pct = mitochondrial percentage
doublet_score = Scrublet output
wโ, wโ, wโ = weights (importance of each factor)
High Q score โ Keep cell โ
Low Q score โ Remove cell โ
Visual representation:
Quality Score Distribution:
โ Keep these โ โฑโฒ
Cellsโ โฑ โฒ
โ โฑ โฒ___
โ Remove these โ โฑ โฒ
โ___________________โฑ____________โฒ____
Low High
Quality Score โ
๐ Visualizing Quality Metrics
Here's what our quality control plots reveal:
Number of Genes Detected
๐ป
โ โฑ โฒ
โ โฑ โฒ
Cellsโ โฑ โฒ___
โ โฑ โฒ___
โโโโโโโโโโโโโโโโโโโโโโโ
0 2000 4000 6000
This violin plot shows:
- โ Most cells: 2,000-4,000 genes detected
- โ Low-quality cells: <1,000 genes detected
- โ ๏ธ Possible doublets: >6,000 genes detected
๐ญ Doublet Detection: Finding Cellular Impostors
sc.pp.scrublet(fresh_blood_adata)
Doublets are when two cells end up in the same droplet. It's like conducting interviews but accidentally interviewing two people at onceโtheir combined answers make no sense!
How Scrublet works:
- Simulates artificial doublets by merging random pairs of real cells
- Compares each real cell to these fake doublets
- Flags cells that look suspiciously similar to the fakes
Cell A + Cell B = Doublet
๐ต + ๐ข = ๐ต๐ข
(T cell) (B cell) (Fake "hybrid" cell)
Part 3: Normalization - Making Apples Comparable to Apples ๐
Different cells captured different amounts of RNAโnot because they're biologically different, but due to random technical variation. Some droplets caught "talkative" cells, others caught "quiet" ones.
The problem: Cell A might have 50,000 RNA molecules, Cell B only 5,000. Does this mean Cell A is more active? Not necessarily! Cell A might have just been captured more efficiently.
The solution: Normalize!
# Scale each cell to the same total count
sc.pp.normalize_total(fresh_blood_adata)
# Log-transform to stabilize variance
sc.pp.log1p(fresh_blood_adata)
This is like adjusting volume levels so every musician can be heard equally. Now we can compare their actual melodies, not just their loudness.
Why Log Transformation?
Gene expression follows a power law distributionโa few genes are massively expressed, most are moderate or low. Log transformation compresses this range so we can see patterns in both highly and lowly expressed genes.
Before log:
Gene A: โ
Gene B: โ
Gene C: โโโโโโโโโโโโโโโโโโโโ (drowns out everything!)
After log:
Gene A: โ
Gene B: โโ
Gene C: โโโโ (all visible!)
Part 4: Feature Selection - Finding the Genes That Matter ๐ฏ
Out of 20,000 genes, which ones actually help us distinguish cell types?
Most genes are "housekeeping genes"โgenes that every cell needs to survive, like the genes for making ribosomes or processing energy. These genes are like everyone in the city wearing pantsโsure, it's universal, but it doesn't tell you whether someone is a doctor, teacher, or artist.
We want the distinctive genesโthe ones that make cell types unique:
- CD3D, CD3E โ T cell markers (like a doctor's stethoscope)
- CD79A, CD79B โ B cell markers (like a teacher's chalk)
- NKG7, GNLY โ NK cell markers (like a soldier's uniform)
sc.pp.highly_variable_genes(fresh_blood_adata, n_top_genes=1000)
This selects the top 1,000 most variable genesโgenes that change dramatically between cells, indicating biological differences rather than random noise.
๐ Highly Variable Genes Plot:
Dispersion
โ
โ โ โ Selected genes
โ โ โโ โ (high variance)
โ โ โ โ
โ โ โ
โโโโโโโผโโโโโโโโโ Mean Expression
โ โ โ โ Not selected
โโ โ (low variance)
Part 5: Dimensionality Reduction - From 1,000 Dimensions to 2D ๐
We now have 1,000 genes per cell. That's 1,000 dimensionsโimpossible to visualize! Imagine trying to draw a 1,000-dimensional graph. Your brain would melt.
Enter PCA (Principal Component Analysis)
PCA is like finding the best camera angles to photograph a complex 3D sculpture. Instead of taking 1,000 photos from random angles, we find the 30-50 angles that show the most interesting features.
sc.tl.pca(fresh_blood_adata)
How PCA works (simplified):
- Find the direction with the most variation (PC1)
- Find the next direction with the most variation perpendicular to PC1 (PC2)
- Continue for 50 components
Original 1,000 dimensions:
๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ... (ร100)
Compressed to 50 PCs:
๐ฏ๐ฏ๐ฏ๐ฏ๐ฏ (captures 80%+ of the information!)
UMAP: The Final Visualization
PCA gives us 50 dimensionsโbetter, but still not visualizable. Enter UMAP (Uniform Manifold Approximation and Projection):
sc.tl.umap(fresh_blood_adata)
UMAP is like taking a complex network map of a city with all its roads, buildings, and connections, and drawing a simplified 2D map that preserves the important relationships:
- Neighborhoods stay together
- Connected areas remain connected
- Distances approximately preserved
UMAP Plot
โญโโโโโโโโโโโโโโโโโฎ
โ ๐ต๐ต๐ต โ
โ ๐ต๐ต โ
โ โ
โ ๐ข๐ข๐ข โ
โ ๐ข๐ข โ
โ โ
โ ๐ด๐ด๐ด โ
โ ๐ด๐ด โ
โฐโโโโโโโโโโโโโโโโโฏ
๐ต = T cells
๐ข = B cells
๐ด = NK cells
Part 6: Clustering - Discovering Cell Types ๐ฌ
Now we can see clear islands of cells in our UMAP plot. But which island is which cell type?
The Leiden Algorithm: Finding Communities
sc.tl.leiden(fresh_blood_adata, resolution=0.5)
Think of your cells as people at a conference. The Leiden algorithm is like observing who talks to whom:
- People cluster into groups (research fields)
- Similar people group together naturally
- Some people bridge multiple groups
How it works:
- Build a network where similar cells are connected
- Find communities where connections are dense internally but sparse externally
- Iteratively optimize these communities
Network Graph:
T cells: B cells:
โโโโโ โโโ
โ โ โ โ โ
โโโโโ โโโ
โ (weak connection)
NK cells:
โโโโโ
โ โ โ
โโโโโ
๐ Resolution: Choosing the Right Magnification
The resolution parameter is like choosing how close to zoom in:
# Low resolution (0.02): See major cell lineages
sc.tl.leiden(adata, resolution=0.02)
# โ 3-5 clusters (T cells, B cells, Myeloid)
# Medium resolution (0.5): See cell types
sc.tl.leiden(adata, resolution=0.5)
# โ 10-15 clusters (CD4+ T, CD8+ T, Memory B, Naive B...)
# High resolution (2.0): See cell subtypes
sc.tl.leiden(adata, resolution=2.0)
# โ 30+ clusters (Th1, Th2, Th17, Tregs...)
It's like looking at a forest:
- ๐ฒ Far away: You see the forest (low resolution)
- ๐ณ Closer: You see individual tree species (medium resolution)
- ๐ Very close: You see oak subspecies and age classes (high resolution)
Part 7: Cell Type Annotation - Putting Names to Faces ๐ท๏ธ
We have clusters, but what ARE they? This is where Decoupler comes in.
Decoupler + PanglaoDB: The Cell Type Detective
markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=fresh_blood_adata, net=markers, tmin=3)
PanglaoDB is a database containing thousands of known cell type markers from published studies. It's like having a library of "wanted posters" for different cell types:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ B CELL MARKERS โ
โ - CD19 - CD79A โ
โ - CD79B - MS4A1 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ T CELL MARKERS โ
โ - CD3D - CD3E โ
โ - CD3G - CD4 โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
How Decoupler annotates cells:
- For each cell, check how many B cell markers it expresses
- Check how many T cell markers it expresses
- Check all other cell types
- Assign the cell to the type with the highest score
Cell X:
B cell score: 0.2 โญ
T cell score: 8.7 โญโญโญโญโญโญโญโญ
NK cell score: 0.1 โญ
โ Cell X is a T CELL! ๐ฏ
Statistical Ranking
We go one step further by statistically testing which cell type signature is most enriched in each cluster:
fresh_blood_adata_rank = dc.tl.rankby_group(
score,
groupby="leiden_res0_02",
reference="rest",
method="t-test_overestim_var"
)
This is like asking: "For each cluster, which cell type signature is significantly higher here than everywhere else?"
Part 8: Validation - Trust, But Verify โ
Automated annotation is powerful, but we still need to validate with known marker genes. Let's visualize!
The Dot Plot: Expression at a Glance
sc.pl.dotplot(fresh_blood_adata, marker_genes_dict, "cell_type")
A dot plot shows two things:
- Dot size: What percentage of cells express this gene?
- Dot color: How strongly is it expressed?
CD3D CD79A NKG7
T cells โโโ ยท ยท
B cells ยท โโโ ยท
NK cells ยท ยท โโโ
โ = large, red dot (high expression)
ยท = tiny dot (low/no expression)
If our annotation is correct, we should see:
- โ T cell markers HIGH in T cells, LOW elsewhere
- โ B cell markers HIGH in B cells, LOW elsewhere
- โ NK cell markers HIGH in NK cells, LOW elsewhere
The Violin Plot: Distribution Stories
sc.pl.violin(score, keys=["B cells memory"], groupby="leiden_res0_02")
Violin plots show the full distribution of expression levels:
Cluster 0: ๐ป (wide, high median โ B cells!)
Cluster 1: ๐ต (narrow, low โ not B cells)
Cluster 2: ๐ต (narrow, low โ not B cells)
The Heatmap: The Big Picture
sc.pl.heatmap(fresh_blood_adata, marker_genes_dict, groupby="leiden_res0_02")
A heatmap shows everything at once:
CD3D CD3E CD79A MS4A1 NKG7 GNLY
Cluster 0: ๐ฆ๐ฆ ๐จ๐จ ๐ฆ๐ฆ (T cells)
Cluster 1: ๐ฆ๐ฆ ๐จ๐จ ๐ฆ๐ฆ (B cells)
Cluster 2: ๐ฆ๐ฆ ๐ฆ๐ฆ ๐จ๐จ (NK cells)
๐จ = high expression
๐ฆ = low expression
Each row should show a distinct pattern, like a unique barcode for each cell type.
๐งฌ The Biology Behind the Code: What We Discovered
After all this analysis, what did we actually learn about these blood cells?
The Major Players in Your Immune System
๐ต T Cells (CD3+): Your adaptive immune system's quarterbacks
- Coordinate immune responses
- Kill infected cells
- Remember past infections
- Represent ~70% of lymphocytes in blood
๐ข B Cells (CD19+, CD79A+): Your antibody factories
- Produce immunoglobulins
- Create immunological memory
- Present antigens to T cells
- ~10-15% of lymphocytes
๐ก NK Cells (CD56+, NKG7+): Your innate immune system's assassins
- Kill virus-infected and tumor cells
- Don't need prior exposure
- Rapid response team
- ~5-15% of lymphocytes
๐ด Monocytes (CD14+, CD16+): Your mobile macrophages
- Engulf pathogens and debris
- Differentiate into macrophages and dendritic cells
- Present antigens
- ~2-10% of white blood cells
Subtypes and Complexity
Within each major type, we find subtypes:
T Cells:
- CD4+ Helper T cells (orchestrate immune response)
- CD8+ Killer T cells (destroy infected cells)
- Regulatory T cells (prevent autoimmunity)
- Memory T cells (remember past infections)
B Cells:
- Naive B cells (haven't encountered antigen yet)
- Memory B cells (remember past infections)
- Plasma cells (antibody-producing factories)
This heterogeneity is exactly WHY single-cell analysis is revolutionaryโwe'd miss all this complexity in bulk sequencing!
๐ป The Code That Makes It Happen
Let me show you the elegant simplicity of the core workflow:
# 1. Load and QC
adata = sc.read_10x_h5(filename)
sc.pp.calculate_qc_metrics(adata, qc_vars=["MT", "RIBO", "HB"], inplace=True)
sc.pp.filter_cells(adata, min_genes=1000)
sc.pp.filter_genes(adata, min_cells=1000)
# 2. Normalize
adata.layers["counts"] = adata.X.copy() # Save raw counts
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
# 3. Feature selection
sc.pp.highly_variable_genes(adata, n_top_genes=1000)
# 4. Dimensionality reduction
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
# 5. Clustering
sc.tl.leiden(adata, resolution=0.5)
# 6. Cell type annotation
markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=adata, net=markers)
# 7. Visualize
sc.pl.umap(adata, color=["leiden", "cell_type"])
sc.pl.dotplot(adata, marker_genes, "cell_type")
That's it! About 15 lines of code to go from raw data to annotated cell types. Of course, the devil is in the detailsโchoosing the right parameters, validating results, interpreting biologyโbut the core workflow is remarkably elegant.
๐ Get the Full Code: Check out the complete Jupyter notebook on GitHub with step-by-step explanations and all visualizations!
๐ Real-World Applications: Why This Matters
1. Cancer Immunotherapy ๐๏ธ
Single-cell analysis revealed why some patients respond to immunotherapy and others don't:
- Responders have more "exhausted" T cells that can be reactivated
- Non-responders lack these cells or have immunosuppressive cells
- Led to combination therapies targeting specific cell states
2. COVID-19 Severity ๐ฆ
During the pandemic, scRNA-seq showed:
- Severe cases had dysregulated interferon responses
- Certain monocyte subtypes predicted worse outcomes
- Helped identify therapeutic targets
3. Autoimmune Diseases ๐
Discovered that:
- Regulatory T cells are dysfunctional, not absent, in many autoimmune conditions
- Different autoimmune diseases have distinct cellular signatures
- Opened new avenues for targeted therapy
4. Developmental Biology ๐ถ
Mapping how embryonic stem cells become specialized:
- Identified intermediate cell states
- Revealed decision points in cell fate
- Guided strategies for lab-grown tissues
๐ The Future: Where Are We Headed?
Multi-Omics Integration
Imagine measuring not just RNA, but also:
- Proteins (what's actually doing the work)
- Chromatin accessibility (which genes CAN be turned on)
- DNA methylation (epigenetic memory)
- TCR/BCR sequences (immune receptor repertoire)
All in the SAME cell. This is happening now!
Spatial Transcriptomics
The next frontier: measuring gene expression while preserving WHERE each cell sits in the tissue:
Traditional scRNA-seq Spatial
โโโโโโ ๐ต๐ต๐ข๐ต๐ต
โโโโโโ ๐ต๐ต๐ข๐ต๐ต
(lost spatial info) ๐ข๐ข๐ข๐ข๐ข
๐ด๐ด๐ด๐ด๐ด
(preserved positions!)
This reveals:
- Cell-cell communication networks
- Tissue architecture
- Microenvironments
- Gradients and zones
AI-Powered Analysis
Machine learning models are being trained on millions of cells to:
- Automatically annotate any cell from any tissue
- Predict drug responses
- Identify disease signatures
- Generate hypotheses
Real-Time Clinical Diagnostics
The dream: rapid scRNA-seq in hospitals:
- Patient presents with immune disorder
- Draw blood
- scRNA-seq analysis in 24 hours
- Precise diagnosis and personalized treatment plan
๐ค The Philosophy: Why Small Things Matter
There's something profound about single-cell analysis. For centuries, we studied populations, averages, bulks. We asked "What does THE cell do?" as if there was one representative of each type.
Single-cell analysis taught us humility. There is no "THE cell." Every cell is an individual with its own history, state, and destiny. The T cell next to its sibling might be:
- Responding to infection while its neighbor rests
- Expressing different genes despite identical DNA
- On a path to memory while its twin approaches exhaustion
This is true for cells, and perhaps, it's true for life. We are not averages. We are distributions. We are heterogeneous. And that heterogeneity is not noiseโit IS the biology.
When we measure 10,000 cells individually, we're not just generating data. We're acknowledging that:
- Rare cells matter (the 0.1% might save your life)
- Transitions matter (cells in-between states reveal mechanisms)
- Variation matters (diversity enables adaptation)
๐ Getting Started: Your Own Single-Cell Journey
Want to try this yourself? Here's your roadmap:
1. Learn Python Basics (2-4 weeks)
- Variables, loops, functions
- Pandas for data manipulation
- Matplotlib for visualization
2. Understand the Biology (ongoing)
- Cell biology fundamentals
- Immunology basics (if analyzing immune cells)
- Read reviews on scRNA-seq
3. Work Through Tutorials (4-8 weeks)
- Scanpy tutorials
- Seurat tutorials (R alternative)
- Published workflows
4. Analyze Public Data (2-3 months)
- 10x Genomics public datasets
- Human Cell Atlas
- GEO database
5. Design Your Own Experiment (โ)
- Define biological question
- Plan experimental design
- Collaborate with experts
๐ Resources
Essential Reading:
- "Current best practices in single-cell RNA-seq analysis" (Luecken & Theis, 2019)
- Scanpy documentation
- Single Cell Best Practices online book
Datasets to Practice:
Communities:
- Biostars forum
- r/bioinformatics
- Single-cell analysis Slack/Discord channels
๐ผ Final Thoughts: The Symphony of Life
When I first looked at a UMAP plot of 10,000 cells, each dot representing an individual cell, I was struck by the beauty of it. These aren't just data points. Each one is a living entity with:
- Billions of proteins working in concert
- Thousands of molecular machines running simultaneously
- A complete copy of your genome deciding what to become
- A past (where it came from) and future (what it will become)
And we can listen to each one's unique molecular signatureโthe genes it chooses to express at this moment in time.
It's like conducting a 10,000-piece orchestra where:
- Every musician plays their own score
- The music changes moment by moment
- Patterns emerge from seeming chaos
- And somehow, together, they create the symphony of life
That's what single-cell RNA sequencing lets us do: listen to the individual voices while appreciating the collective harmony.
Your immune systemโright now, as you read thisโcontains billions of these cells. Each one performing its role, some fighting infections you'll never know you had, some remembering that cold from three years ago, some just quietly waiting for their moment to act.
We're just beginning to learn their stories. And with every cell we sequence, every algorithm we refine, every biological insight we gain, we get a little closer to understanding the fundamental question:
What does it mean to be alive, one cell at a time?
๐ Acknowledgments & Further Exploration
This analysis was performed using:
- HackBio Intership Analysis
- Scanpy (Wolf et al., 2018)
- AnnData data structure
- Decoupler (Badia-i-Mompel et al., 2022)
- PanglaoDB marker database
- 10x Genomics technology and datasets
๐ Access the Full Tutorial
The complete Jupyter notebook with all code, visualizations, and detailed explanations is available on GitHub:
View the Human Diseased PBMC Notebook on GitHub
What's included:
- โ Complete analysis pipeline
- โ All code cells with explanations
- โ Quality control visualizations
- โ Clustering and annotation workflows
- โ Interactive plots and heatmaps
๐ Connect & Learn
- ๐ GitHub Repository
- ๐ฌ Questions? Drop a comment below!
- ๐ Follow for more bioinformatics tutorials
- ๐ง Reach out for collaborations
"In the world of single-cell analysis, every cell has a story. Our job is to listen."
Top comments (1)
Great post!