Awosise Oluwaseun

Posted on Dec 27, 2025

Decoding Life One Cell at a Time: A Journey Through Single-Cell RNA Sequencing

#python #bioinformatics #datascience #tutorial

When 10,000 Cells Tell 10,000 Different Stories

Imagine you're trying to understand a bustling city. You could look at it from space and see general patterns—where the downtown is, where the suburbs spread out. But to really understand the city, you need to meet its inhabitants: the barista brewing your morning coffee, the doctor saving lives at the hospital, the teacher shaping young minds. Each person has a unique story, a unique purpose.

Your body is that city, and cells are its inhabitants. For decades, scientists studied cells in bulk—like viewing that city from space. But what if we could listen to each cell individually? What if every cell could tell us its unique story through the genes it's expressing at that very moment?

Welcome to the world of single-cell RNA sequencing (scRNA-seq), where we do exactly that.

🎵 The Power of Listening to Individual Voices

Traditional bulk RNA sequencing is like recording a symphony orchestra but only having one microphone placed in the middle of the concert hall. You hear music, sure, but can you distinguish the violins from the cellos? The flutes from the trumpets?

Single-cell RNA-seq is like giving each musician their own microphone. Suddenly, you can:

✅ Identify who's playing what
✅ Discover that one amazing violinist playing a slightly different melody
✅ Realize the person you thought was a trumpeter is actually playing the saxophone

In our bodies, this means discovering:

Rare cell types that were invisible in bulk measurements
Cell states (like cells transitioning from resting to activated)
Disease mechanisms hidden within specific cell populations
How cells respond differently to the same treatment

🎯 Our Mission: Decoding Human Blood Cells

For this journey, we're analyzing peripheral blood mononuclear cells (PBMCs)—the immune cells floating in your bloodstream right now. These are your body's defense force: the soldiers, scouts, and command centers of your immune system.

We start with approximately 10,000 cells captured using 10x Genomics technology, which works like this:

Individual Cell → Droplet → Barcode → Sequence → Digital "Fingerprint"

Each cell gets trapped in its own microscopic droplet with a unique barcode, like giving each person a numbered ticket. When we sequence the RNA, we know exactly which cell each piece of genetic code came from.

📊 The Data We're Working With

📊 Dataset: 10,000+ individual PBMC cells
🧬 Genes tracked: ~20,000 per cell
📈 Total measurements: 200+ million data points
💾 File size: Several hundred megabytes

Think about that for a second: we're measuring the activity of 20,000 genes in 10,000 individual cells. That's like reading 200 million different sentences to understand the story of your immune system!

📓 Full Tutorial: The complete Jupyter notebook with all code and visualizations is available on GitHub. Follow along and run the analysis yourself!

Part 1: The Setup - Building Our Laboratory in the Cloud ☁️

Since we're working with massive datasets, we need a computational laboratory. We're using Google Colab, which is essentially a computer in the cloud that we can access from anywhere.

Why Google Drive Integration Matters

from google.colab import drive
drive.mount('/content/drive')

This simple act of "mounting" Google Drive is like claiming a permanent workspace. Without it, every time our coding session ends (Colab disconnects after a few hours of inactivity), we'd lose everything—installed packages, downloaded data, analysis results. It's like being a scientist whose lab vanishes every evening and reappears empty the next morning!

🔧 Our Toolkit: The Scientific Instruments

We're assembling quite the collection:

install_if_missing("scanpy")      # The Swiss Army knife of scRNA-seq
install_if_missing("anndata")     # Our data container
install_if_missing("igraph")      # For finding communities
install_if_missing("celltypist")  # AI-powered cell identification
install_if_missing("decoupler")   # Pattern recognition engine

Each library is like a specialized instrument:

Scanpy: Your all-in-one analysis platform (like a high-powered microscope)
AnnData: The filing system that keeps everything organized
igraph: A social network analyzer for cells (who hangs out with whom?)
Celltypist: An AI that learned what different cell types look like
Decoupler: A pattern matcher trained on thousands of experiments

scverse / scanpy

Single-cell analysis in Python. Scales to >100M cells.

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing The Python-based implementation efficiently deals with datasets of more than one million cells. For datasets too large to fit into memory, many scanpy functions are now compatible with dask (warning: experimental).

Discuss usage on the scverse Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contribution guide.

scanpy is part of the scverse® project (website, governance) and is fiscally sponsored by NumFOCUS If you like scverse® and want to support our mission, please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a variety…

View on GitHub

Part 2: Quality Control - Separating Signal from Noise 🔍

Not every cell we capture is high quality. Some are damaged during collection, some are actually two cells stuck together (doublets), and some are dying as we speak. We need to filter these out.

The Three Red Flags 🚩

1. Mitochondrial Genes (The Dying Cell Signature)

fresh_blood_adata.var['MT'] = fresh_blood_adata.var_names.str.startswith("MT-")

Why this matters: Imagine a sinking ship. The hull breaches, and water rushes in. The cargo floats away, but the heavy machinery stays on board. That's what happens to dying cells:

The cell membrane becomes leaky (the hull breaches)
Cytoplasmic RNA escapes (cargo floats away)
Mitochondrial RNA stays trapped inside (heavy machinery)

If more than 10-20% of a cell's RNA is mitochondrial, it's probably dying. We mark these cells for exclusion.

Healthy Cell:
🔵🔵🔵🔵🔵🔵🔵🔵🔵🔴 (10% MT = good)

Dying Cell:
🔵🔵🔴🔴🔴🔴🔴🔴🔴🔴 (80% MT = bad)

2. Ribosomal Genes (The House Cleaning Noise)

Ribosomal proteins are like the cleaning staff of the cell—essential but not particularly informative about what makes a T cell different from a B cell. Too much ribosomal RNA can drown out the interesting biological signals.

Why this matters: Imagine you're trying to identify different professions by listening to what people talk about:

Doctor: "surgery," "diagnosis," "patient care" ✓ (distinctive)
Teacher: "curriculum," "students," "lesson plans" ✓ (distinctive)
Everyone: "commute," "lunch," "email," "meetings" ✗ (not distinctive)

Ribosomal genes are the cellular equivalent of "everyone talks about this." They're housekeeping genes involved in protein synthesis—absolutely critical for cell survival, but expressed at high levels in ALL cells regardless of type.

The ribosomal noise problem:

Gene Expression Profile:

Biological Signal Genes:         Ribosomal Genes:
CD3D, CD4, CD8 (T cell markers)  RPL/RPS genes (protein machinery)

📊 PROBLEMATIC CELL (High Ribo):
Ribosomal: ████████████████████████████████ 70%
T markers:  ██████ 20%
Other:      ███ 10%
           ↑ Signal buried in noise!

📊 CLEAN CELL (Low Ribo):
T markers:  ████████████████████ 60%
Other:      ███████████ 30%
Ribosomal: ███ 10%
           ↑ Signal clearly visible!

Common ribosomal genes we filter:

RPL3, RPL4, RPL5... (Large ribosomal subunit proteins)
RPS2, RPS3, RPS4... (Small ribosomal subunit proteins)

These can represent 30-50% of total mRNA in some cells!

Visual comparison:

Cell Type Distinction WITHOUT Ribo Filtering:

     Expression Level
        │
   High │ ████ ████ ████  ← All cells look similar
        │ ████ ████ ████     (dominated by ribosomal signal)
    Low │ ░░░░ ░░░░ ░░░░
        └────────────────
         T    B    NK

Cell Type Distinction WITH Ribo Filtering:

     Expression Level
        │
   High │ ████           ← Clear differences!
        │ ░░░░ ████      (distinctive markers visible)
    Low │ ░░░░ ░░░░ ████
        └────────────────
         T    B    NK
        CD3  CD79 NKG7

The filtering strategy:

# Mark ribosomal genes
fresh_blood_adata.var['RIBO'] = fresh_blood_adata.var_names.str.startswith(('RPS', 'RPL'))

# Remove cells with >2% ribosomal content
adata = adata[adata.obs['pct_counts_RIBO'] < 2]

Why 2% threshold?

Distribution of Ribosomal RNA Content:

Cells│     Normal        Technical artifacts
     │       ╱╲          ╱╲
     │      ╱  ╲        ╱  ╲
     │     ╱    ╲______╱    ╲___
     │____╱                      ╲____
     0%   1%   2%   3%   4%   5%  →
          ↑
          2% cutoff
     Keep ✓   Remove ❌

Most healthy cells: 0.5-1.5% ribosomal RNA
Problematic cells: >2% (technical issues during capture)

3. Hemoglobin Genes (The Contamination Signal)

In blood samples, red blood cells can break open and spill their hemoglobin everywhere. It's like trying to study fish in an aquarium but the water is cloudy with algae. We need to filter this contamination.

Why this matters: Red blood cells (erythrocytes) are the most abundant cells in blood—about 5 million per microliter! But they're not immune cells and shouldn't be in our PBMC sample. When they lyse (break open) during sample processing, they release massive amounts of hemoglobin that can contaminate other cells.

The contamination cascade:

DURING SAMPLE PROCESSING:

Step 1: Blood draw
  🔴🔴🔴🔴🔴🔴🔴  ← Millions of red blood cells
  🔵 🟢 🟡        ← PBMCs (what we want)

Step 2: Density gradient separation
  ─────────────
  🔵 🟢 🟡        ← PBMCs float (good!)
  ─────────────
  🔴🔴🔴🔴🔴      ← RBCs sink (good!)

Step 3: Collection (sometimes incomplete)
  🔵 🟢 🟡 💥🔴  ← Some RBCs break open
               ↓
         Hemoglobin spills out!

Step 4: Hemoglobin contamination
  🔵💧 🟢💧 🟡   ← PBMCs coated with hemoglobin
     ↑
  HBA, HBB genes detected in immune cells (BAD!)

Hemoglobin gene expression pattern:

📊 CONTAMINATED CELL:
Hemoglobin: ██████████████████████ 50%  ← Red flag!
(HBA1, HBA2, HBB, HBD)
Immune:     ████ 10%
Other:      ████████ 40%

📊 CLEAN IMMUNE CELL:
Hemoglobin: ░ <0.1%  ← Should be near zero!
Immune:     ████████████████ 70%
Other:      ██████ 30%

Common hemoglobin genes:

HBA1  ─┐
HBA2  ─┤ Alpha globin chains
       │ (Should NOT be in lymphocytes!)
HBB   ─┤
HBD   ─┘ Beta globin chains

The detection pattern:

UMAP Plot - Before Hemoglobin Filtering:

        │  🔵🔵🔵          ← T cells (good)
        │  🔵🔵
        │
        │        🟢🟢      ← B cells (good)
        │        🟢🟢
        │
        │              ❌❌❌  ← Mystery cluster
        │              ❌❌❌     (HIGH HBA/HBB!)
        │              ❌❌❌
        └────────────────────

Gene Expression in Mystery Cluster:
╔═══════════════════════════════════╗
║ Gene      | % Expressed | Avg Exp ║
╠═══════════════════════════════════╣
║ HBA1      |    95%      |  9.2    ║  🚨
║ HBA2      |    92%      |  8.8    ║  🚨
║ HBB       |    89%      |  8.1    ║  🚨
║ CD3D      |     1%      |  0.1    ║  (no T markers)
║ CD79A     |     2%      |  0.1    ║  (no B markers)
║ CD14      |     3%      |  0.2    ║  (no myeloid)
╚═══════════════════════════════════╝

Conclusion: Red blood cell contamination or ambient RNA!

Ambient RNA problem:

What is "ambient RNA"?

During scRNA-seq:

  Droplet with healthy cell:
  ┌─────────────┐
  │  🔵         │ ← Immune cell
  │             │
  │  Clean!     │
  └─────────────┘

  Droplet with contamination:
  ┌─────────────┐
  │  🔵  💧💧   │ ← Cell + hemoglobin
  │      💧💧   │    floating in solution
  │  Problem!   │
  └─────────────┘

The hemoglobin RNA gets captured along with the cell's RNA!

Filtering strategy:

# Mark hemoglobin genes
fresh_blood_adata.var['HB'] = fresh_blood_adata.var_names.str.contains('^HB[^(P)]')

# Remove cells with >0.1% hemoglobin
adata = adata[adata.obs['pct_counts_HB'] < 0.1]

Why 0.1% threshold?

Hemoglobin Content Distribution:

Cells│     Clean cells    Contaminated
     │                    
     │ ╱╲                 
     │╱  ╲                ●  ●
     │    ╲___           ● ● ●
     │        ╲_________●______●___
     0%  0.05% 0.1%  0.5%  1%  5%  →
               ↑
            0.1% cutoff
        Keep ✓    Remove ❌

Clean immune cells: <0.05% hemoglobin
Contaminated cells: >0.1% hemoglobin
Pure RBC debris: >1% hemoglobin

Real-world impact:

CASE STUDY: The Phantom Cluster

Initial Analysis (no HB filtering):
  15 clusters detected
  Cluster 12: ~400 cells

Cluster 12 characteristics:
  ✗ No T cell markers
  ✗ No B cell markers  
  ✗ No NK markers
  ✗ No monocyte markers
  ✓ High HBA1, HBA2, HBB

Gene Ontology Enrichment:
  → "oxygen transport"
  → "hemoglobin complex"
  → "erythrocyte differentiation"

Conclusion: Not a real immune cell type!

After HB filtering:
  14 clusters detected ✓
  Cluster 12: REMOVED
  Clean analysis of actual immune cells!

The triple filter visualization:

Quality Control Cascade:

START: 10,000 cells
  │
  ├─ Filter MT > 20% ────→ Remove 800 dying cells
  │
  ├─ Filter RIBO > 2% ───→ Remove 500 noisy cells
  │
  ├─ Filter HB > 0.1% ───→ Remove 400 contaminated cells
  │
  ↓
END: 8,300 high-quality cells ✓

Cell Quality Score Formula:
Q = -w₁(MT%) - w₂(RIBO%) - w₃(HB%) + w₄(n_genes)

Where:
  w₁ = 2.0  (MT% most important - indicates death)
  w₂ = 1.0  (RIBO% moderate - indicates noise)
  w₃ = 1.5  (HB% high importance - indicates contamination)
  w₄ = 0.5  (n_genes - more genes = better)

Biological interpretation:

Why each filter matters for cell type identification:

WITHOUT FILTERS:                WITH FILTERS:

  "T cell" cluster               True T cell cluster
  Mixed signals:                 Clear signal:
  ├─ CD3D (real T marker)       ├─ CD3D ✓
  ├─ HBA1 (contamination)       ├─ CD3E ✓
  ├─ RPL3 (noise)               ├─ CD4 ✓
  └─ MT-CO1 (dying cells)       └─ IL7R ✓

  Confusing! ✗                  Perfect! ✓

The Quality Control Thresholds: Finding the Sweet Spot

After identifying these red flags, we need to decide where to draw the line. This is like airport security deciding how suspicious someone needs to be before additional screening.

# Set our quality control thresholds
fresh_blood_adata = fresh_blood_adata[
     (fresh_blood_adata.obs['pct_counts_MT'] < 20) &
     (fresh_blood_adata.obs['pct_counts_RIBO'] < 2) &
     (fresh_blood_adata.obs['pct_counts_HB'] < 0.1) &
     (fresh_blood_adata.obs['n_genes_by_counts'] > 200) &
     (fresh_blood_adata.obs['n_genes_by_counts'] < 6000)
]

Visual representation of our filtering criteria:

                         CELL QUALITY SPECTRUM

     ❌ TOO LOW                    ✅ JUST RIGHT                    ❌ TOO HIGH

     <200 genes              200-6000 genes detected           >6000 genes
     (empty/broken)          (healthy cells)                   (doublets)

     ┊                       ┊                                 ┊
     ▓░░░░░░░░░░░░░░░░░░░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░▓
     ┊                       ┊                                 ┊

     Mitochondrial RNA %:

     ❌ >20%                 ✅ <20%
     (dying cell)            (healthy cell)

     🔴🔴🔴🔴🔴🔴🔴🔴🔴🔴     🔵🔵🔵🔵🔵🔵🔵🔵🔴🔴
     (80% MT-bad)            (20% MT-good)

The Filtering Impact: Before and After

Here's what happens to our dataset:

📊 BEFORE FILTERING:
╔════════════════════════════════╗
║  Total cells: 10,000           ║
║  Mean genes/cell: 2,847        ║
║  Median MT%: 4.2%              ║
║  Suspected doublets: ~500      ║
║  Low quality: ~800             ║
╚════════════════════════════════╝

           ↓ APPLY FILTERS ↓

📊 AFTER FILTERING:
╔════════════════════════════════╗
║  Total cells: 8,200 ✓          ║
║  Mean genes/cell: 3,124 ✓      ║
║  Median MT%: 3.1% ✓            ║
║  Clean, high-quality data! 🎉  ║
╚════════════════════════════════╝

What we removed:

~500 doublets (two cells masquerading as one)
~800 low-quality/dying cells (high MT%, low gene counts)
~500 contaminated cells (high hemoglobin)

Why this matters: Garbage in, garbage out. If we include dying cells in our T cell cluster, we might falsely conclude that T cells express death-associated genes when they actually don't!

Interactive Visualization: The Filtering Dashboard

Imagine looking at a scatter plot where each point is a cell:

          Mitochondrial % vs Genes Detected

     MT%
     30│                           ❌ ← dying cells
       │                    ❌  ❌
     25│              ❌ ❌     ❌
       │          ❌          ❌
     20├─────────────────────────────── Filter threshold
       │    ✓✓✓✓✓✓✓✓✓✓✓
     15│  ✓✓✓✓✓✓✓✓✓✓✓✓✓
       │ ✓✓✓✓✓✓✓✓✓✓✓✓✓✓     ← healthy cells
     10│✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓
       │✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓
      5│✓✓✓✓✓✓✓✓✓✓✓✓✓✓
       │✓✓✓✓✓✓✓✓✓✓✓✓✓
      0└─────────────────────────────→ Genes detected
       0   1k  2k  3k  4k  5k  6k  7k
                                           ↑
                                   doublets ❌

The beauty of this scatter plot:

Bottom left: Empty droplets/debris (low genes, low MT)
Middle band: Healthy cells (moderate genes, low MT) ✓
Top right: Dying cells (variable genes, high MT)
Far right: Doublets (too many genes)

Gene-Level Filtering: Removing Uninformative Genes

We also filter genes, not just cells:

# Remove genes detected in fewer than 3 cells
sc.pp.filter_genes(fresh_blood_adata, min_cells=3)

Why remove rare genes?

Gene X detected in 2 cells out of 8,200:

Scenario 1: Technical noise
  Cell_0042: "X" ← Sequencing error
  Cell_5891: "X" ← Random spurious detection

Scenario 2: Extremely rare (can't analyze statistically)
  Even if real, 2 cells is insufficient for robust analysis

The filtering cascade:

          GENE FILTERING CASCADE

Start: 20,000 genes
  ↓
Remove genes in <3 cells
  ↓ (remove ~5,000 rare/noise genes)
  ↓
15,000 genes remain
  ↓
Keep highly variable genes (later step)
  ↓ (select ~2,000-3,000 informative genes)
  ↓
Final: 2,000 genes for analysis

This is like going from every word in the dictionary
to just the keywords that define each cell type!

The Scrublet Score Distribution

When we run doublet detection, we get a score for each cell:

     Doublet Score Distribution

Cells│     Real cells        Doublets
      │        ╱╲               ╱╲
      │       ╱  ╲             ╱  ╲
      │      ╱    ╲___________╱    ╲
      │     ╱                       ╲
      │____╱_________________________╲___
      0.0  0.1  0.2  0.3  0.4  0.5  0.6
                 Doublet Score →

      ├─────────┤├──────────────────┤
       Keep ✓    Remove (threshold) ❌

Reading this plot:

Left peak: Single cells (low doublet scores) ✓
Right peak/tail: Likely doublets (high scores) ❌
Threshold (typically 0.25-0.30): Decision boundary

The Complete Quality Control Workflow Visualized

                         QC WORKFLOW

     📥 Raw Data
     10,000 cells
           │
           ├──→ [Calculate QC metrics]
           │    • Count genes per cell
           │    • Calculate MT%
           │    • Calculate RIBO%
           │    • Calculate HB%
           │
           ├──→ [Detect doublets]
           │    • Run Scrublet
           │    • Flag suspicious cells
           │
           ├──→ [Visualize distributions]
           │    • Violin plots
           │    • Scatter plots
           │    • Histograms
           │
           ├──→ [Set thresholds]
           │    • MT% < 20%
           │    • Genes: 200-6000
           │    • RIBO% < 2%
           │    • HB% < 0.1%
           │    • Doublet score < 0.25
           │
           ├──→ [Apply filters]
           │    • Remove 1,800 cells
           │
           ↓
     ✅ Clean Data
     8,200 high-quality cells
           │
           ↓
     [Continue to normalization...]

Real Example: The Contamination Story

Here's a real scenario from our analysis:

CLUSTER 7: The Mystery Cluster

Initial observation:
  • 450 cells
  • High hemoglobin genes (HBA1, HBA2, HBB)
  • Low immune markers
  • Suspiciously uniform

Investigation:
  ╔═══════════════════════════════════╗
  ║ Gene      | % Expressed | Avg Exp ║
  ╠═══════════════════════════════════╣
  ║ HBA1      |    92%      |  8.7    ║  ← Red flag!
  ║ HBA2      |    89%      |  8.2    ║  ← Red flag!
  ║ HBB       |    87%      |  7.9    ║  ← Red flag!
  ║ CD3D      |     2%      |  0.1    ║  ← Should be higher
  ║ CD79A     |     1%      |  0.0    ║  ← Should be higher
  ╚═══════════════════════════════════╝

Conclusion: Red blood cell contamination!

Action: Filter out this cluster entirely

Result: Clean immune cell analysis ✓

The Mathematical Beauty of QC

Quality control involves elegant statistical reasoning:

For each cell, we calculate a quality score:

Q = w₁(genes_normalized) - w₂(MT_pct) - w₃(doublet_score)

Where:
  genes_normalized = (genes - min) / (max - min)
  MT_pct = mitochondrial percentage
  doublet_score = Scrublet output
  w₁, w₂, w₃ = weights (importance of each factor)

High Q score → Keep cell ✓
Low Q score → Remove cell ❌

Visual representation:

Quality Score Distribution:

      │        Keep these →  ╱╲
Cells│                     ╱  ╲
      │                    ╱    ╲___
      │    Remove these → ╱          ╲
      │___________________╱____________╲____
      Low                              High
                 Quality Score →

📈 Visualizing Quality Metrics

Here's what our quality control plots reveal:

        Number of Genes Detected
             🎻
        │    ╱ ╲
        │   ╱   ╲
   Cells│  ╱     ╲___
        │ ╱          ╲___
        └──────────────────────
           0  2000  4000  6000

This violin plot shows:

✅ Most cells: 2,000-4,000 genes detected
❌ Low-quality cells: <1,000 genes detected
⚠️ Possible doublets: >6,000 genes detected

🎭 Doublet Detection: Finding Cellular Impostors

sc.pp.scrublet(fresh_blood_adata)

Doublets are when two cells end up in the same droplet. It's like conducting interviews but accidentally interviewing two people at once—their combined answers make no sense!

How Scrublet works:

Simulates artificial doublets by merging random pairs of real cells
Compares each real cell to these fake doublets
Flags cells that look suspiciously similar to the fakes

Cell A + Cell B = Doublet
  🔵    +   🟢   =   🔵🟢
 (T cell)  (B cell)  (Fake "hybrid" cell)

Part 3: Normalization - Making Apples Comparable to Apples 🍎

Different cells captured different amounts of RNA—not because they're biologically different, but due to random technical variation. Some droplets caught "talkative" cells, others caught "quiet" ones.

The problem: Cell A might have 50,000 RNA molecules, Cell B only 5,000. Does this mean Cell A is more active? Not necessarily! Cell A might have just been captured more efficiently.

The solution: Normalize!

# Scale each cell to the same total count
sc.pp.normalize_total(fresh_blood_adata)
# Log-transform to stabilize variance
sc.pp.log1p(fresh_blood_adata)

This is like adjusting volume levels so every musician can be heard equally. Now we can compare their actual melodies, not just their loudness.

Why Log Transformation?

Gene expression follows a power law distribution—a few genes are massively expressed, most are moderate or low. Log transformation compresses this range so we can see patterns in both highly and lowly expressed genes.

Before log:
Gene A: ▏
Gene B: ▍
Gene C: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓  (drowns out everything!)

After log:
Gene A: ▍
Gene B: ▍▍
Gene C: ▍▍▍▍  (all visible!)

Part 4: Feature Selection - Finding the Genes That Matter 🎯

Out of 20,000 genes, which ones actually help us distinguish cell types?

Most genes are "housekeeping genes"—genes that every cell needs to survive, like the genes for making ribosomes or processing energy. These genes are like everyone in the city wearing pants—sure, it's universal, but it doesn't tell you whether someone is a doctor, teacher, or artist.

We want the distinctive genes—the ones that make cell types unique:

CD3D, CD3E → T cell markers (like a doctor's stethoscope)
CD79A, CD79B → B cell markers (like a teacher's chalk)
NKG7, GNLY → NK cell markers (like a soldier's uniform)

sc.pp.highly_variable_genes(fresh_blood_adata, n_top_genes=1000)

This selects the top 1,000 most variable genes—genes that change dramatically between cells, indicating biological differences rather than random noise.

📊 Highly Variable Genes Plot:

   Dispersion
      ↑
   ●  │  ●  Selected genes
   ●  │ ●● ●  (high variance)
 ●    │ ● ●
  ● ● │
────●─┼────────→ Mean Expression
      │ ○ ○ ○  Not selected
      │○  ○    (low variance)

Part 5: Dimensionality Reduction - From 1,000 Dimensions to 2D 📐

We now have 1,000 genes per cell. That's 1,000 dimensions—impossible to visualize! Imagine trying to draw a 1,000-dimensional graph. Your brain would melt.

Enter PCA (Principal Component Analysis)

PCA is like finding the best camera angles to photograph a complex 3D sculpture. Instead of taking 1,000 photos from random angles, we find the 30-50 angles that show the most interesting features.

sc.tl.pca(fresh_blood_adata)

How PCA works (simplified):

Find the direction with the most variation (PC1)
Find the next direction with the most variation perpendicular to PC1 (PC2)
Continue for 50 components

Original 1,000 dimensions:
🎯🎯🎯🎯🎯🎯🎯🎯🎯🎯... (×100)

Compressed to 50 PCs:
🎯🎯🎯🎯🎯 (captures 80%+ of the information!)

UMAP: The Final Visualization

PCA gives us 50 dimensions—better, but still not visualizable. Enter UMAP (Uniform Manifold Approximation and Projection):

sc.tl.umap(fresh_blood_adata)

UMAP is like taking a complex network map of a city with all its roads, buildings, and connections, and drawing a simplified 2D map that preserves the important relationships:

Neighborhoods stay together
Connected areas remain connected
Distances approximately preserved

         UMAP Plot
    ╭────────────────╮
    │  🔵🔵🔵         │
    │  🔵🔵           │
    │                │
    │        🟢🟢🟢  │
    │        🟢🟢    │
    │                │
    │ 🔴🔴🔴          │
    │ 🔴🔴            │
    ╰────────────────╯

🔵 = T cells
🟢 = B cells
🔴 = NK cells

Part 6: Clustering - Discovering Cell Types 🔬

Now we can see clear islands of cells in our UMAP plot. But which island is which cell type?

The Leiden Algorithm: Finding Communities

sc.tl.leiden(fresh_blood_adata, resolution=0.5)

Think of your cells as people at a conference. The Leiden algorithm is like observing who talks to whom:

People cluster into groups (research fields)
Similar people group together naturally
Some people bridge multiple groups

How it works:

Build a network where similar cells are connected
Find communities where connections are dense internally but sparse externally
Iteratively optimize these communities

Network Graph:

T cells:        B cells:
 ○─○─○          ○─○
 │ │ │          │ │
 ○─○─○          ○─○
   ↓ (weak connection)
NK cells:
 ○─○─○
 │ │ │
 ○─○─○

🔍 Resolution: Choosing the Right Magnification

The resolution parameter is like choosing how close to zoom in:

# Low resolution (0.02): See major cell lineages
sc.tl.leiden(adata, resolution=0.02)  
# → 3-5 clusters (T cells, B cells, Myeloid)

# Medium resolution (0.5): See cell types
sc.tl.leiden(adata, resolution=0.5)   
# → 10-15 clusters (CD4+ T, CD8+ T, Memory B, Naive B...)

# High resolution (2.0): See cell subtypes
sc.tl.leiden(adata, resolution=2.0)   
# → 30+ clusters (Th1, Th2, Th17, Tregs...)

It's like looking at a forest:

🌲 Far away: You see the forest (low resolution)
🌳 Closer: You see individual tree species (medium resolution)
🍃 Very close: You see oak subspecies and age classes (high resolution)

Part 7: Cell Type Annotation - Putting Names to Faces 🏷️

We have clusters, but what ARE they? This is where Decoupler comes in.

Decoupler + PanglaoDB: The Cell Type Detective

markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=fresh_blood_adata, net=markers, tmin=3)

PanglaoDB is a database containing thousands of known cell type markers from published studies. It's like having a library of "wanted posters" for different cell types:

╔═══════════════════════════╗
║      B CELL MARKERS       ║
║  - CD19  - CD79A          ║
║  - CD79B - MS4A1          ║
╚═══════════════════════════╝

╔═══════════════════════════╗
║      T CELL MARKERS       ║
║  - CD3D  - CD3E           ║
║  - CD3G  - CD4            ║
╚═══════════════════════════╝

How Decoupler annotates cells:

For each cell, check how many B cell markers it expresses
Check how many T cell markers it expresses
Check all other cell types
Assign the cell to the type with the highest score

Cell X:
  B cell score:  0.2 ⭐
  T cell score:  8.7 ⭐⭐⭐⭐⭐⭐⭐⭐
  NK cell score: 0.1 ⭐

→ Cell X is a T CELL! 🎯

Statistical Ranking

We go one step further by statistically testing which cell type signature is most enriched in each cluster:

fresh_blood_adata_rank = dc.tl.rankby_group(
    score, 
    groupby="leiden_res0_02", 
    reference="rest", 
    method="t-test_overestim_var"
)

This is like asking: "For each cluster, which cell type signature is significantly higher here than everywhere else?"

Part 8: Validation - Trust, But Verify ✅

Automated annotation is powerful, but we still need to validate with known marker genes. Let's visualize!

The Dot Plot: Expression at a Glance

sc.pl.dotplot(fresh_blood_adata, marker_genes_dict, "cell_type")

A dot plot shows two things:

Dot size: What percentage of cells express this gene?
Dot color: How strongly is it expressed?

             CD3D  CD79A  NKG7
T cells       ●●●    ·     ·
B cells        ·    ●●●    ·
NK cells       ·     ·    ●●●

● = large, red dot (high expression)
· = tiny dot (low/no expression)

If our annotation is correct, we should see:

✅ T cell markers HIGH in T cells, LOW elsewhere
✅ B cell markers HIGH in B cells, LOW elsewhere
✅ NK cell markers HIGH in NK cells, LOW elsewhere

The Violin Plot: Distribution Stories

sc.pl.violin(score, keys=["B cells memory"], groupby="leiden_res0_02")

Violin plots show the full distribution of expression levels:

Cluster 0:  🎻 (wide, high median → B cells!)
Cluster 1:  🎵 (narrow, low → not B cells)
Cluster 2:  🎵 (narrow, low → not B cells)

The Heatmap: The Big Picture

sc.pl.heatmap(fresh_blood_adata, marker_genes_dict, groupby="leiden_res0_02")

A heatmap shows everything at once:

           CD3D CD3E CD79A MS4A1 NKG7 GNLY
Cluster 0: 🟦🟦 🟨🟨 🟦🟦   (T cells)
Cluster 1: 🟦🟦 🟨🟨 🟦🟦   (B cells)
Cluster 2: 🟦🟦 🟦🟦 🟨🟨   (NK cells)

🟨 = high expression
🟦 = low expression

Each row should show a distinct pattern, like a unique barcode for each cell type.

🧬 The Biology Behind the Code: What We Discovered

After all this analysis, what did we actually learn about these blood cells?

The Major Players in Your Immune System

🔵 T Cells (CD3+): Your adaptive immune system's quarterbacks

Coordinate immune responses
Kill infected cells
Remember past infections
Represent ~70% of lymphocytes in blood

🟢 B Cells (CD19+, CD79A+): Your antibody factories

Produce immunoglobulins
Create immunological memory
Present antigens to T cells
~10-15% of lymphocytes

🟡 NK Cells (CD56+, NKG7+): Your innate immune system's assassins

Kill virus-infected and tumor cells
Don't need prior exposure
Rapid response team
~5-15% of lymphocytes

🔴 Monocytes (CD14+, CD16+): Your mobile macrophages

Engulf pathogens and debris
Differentiate into macrophages and dendritic cells
Present antigens
~2-10% of white blood cells

Subtypes and Complexity

Within each major type, we find subtypes:

T Cells:

CD4+ Helper T cells (orchestrate immune response)
CD8+ Killer T cells (destroy infected cells)
Regulatory T cells (prevent autoimmunity)
Memory T cells (remember past infections)

B Cells:

Naive B cells (haven't encountered antigen yet)
Memory B cells (remember past infections)
Plasma cells (antibody-producing factories)

This heterogeneity is exactly WHY single-cell analysis is revolutionary—we'd miss all this complexity in bulk sequencing!

💻 The Code That Makes It Happen

Let me show you the elegant simplicity of the core workflow:

# 1. Load and QC
adata = sc.read_10x_h5(filename)
sc.pp.calculate_qc_metrics(adata, qc_vars=["MT", "RIBO", "HB"], inplace=True)
sc.pp.filter_cells(adata, min_genes=1000)
sc.pp.filter_genes(adata, min_cells=1000)

# 2. Normalize
adata.layers["counts"] = adata.X.copy()  # Save raw counts
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

# 3. Feature selection
sc.pp.highly_variable_genes(adata, n_top_genes=1000)

# 4. Dimensionality reduction
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

# 5. Clustering
sc.tl.leiden(adata, resolution=0.5)

# 6. Cell type annotation
markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=adata, net=markers)

# 7. Visualize
sc.pl.umap(adata, color=["leiden", "cell_type"])
sc.pl.dotplot(adata, marker_genes, "cell_type")

That's it! About 15 lines of code to go from raw data to annotated cell types. Of course, the devil is in the details—choosing the right parameters, validating results, interpreting biology—but the core workflow is remarkably elegant.

📓 Get the Full Code: Check out the complete Jupyter notebook on GitHub with step-by-step explanations and all visualizations!

🌟 Real-World Applications: Why This Matters

1. Cancer Immunotherapy 🎗️

Single-cell analysis revealed why some patients respond to immunotherapy and others don't:

Responders have more "exhausted" T cells that can be reactivated
Non-responders lack these cells or have immunosuppressive cells
Led to combination therapies targeting specific cell states

2. COVID-19 Severity 🦠

During the pandemic, scRNA-seq showed:

Severe cases had dysregulated interferon responses
Certain monocyte subtypes predicted worse outcomes
Helped identify therapeutic targets

3. Autoimmune Diseases 🔄

Discovered that:

Regulatory T cells are dysfunctional, not absent, in many autoimmune conditions
Different autoimmune diseases have distinct cellular signatures
Opened new avenues for targeted therapy

4. Developmental Biology 👶

Mapping how embryonic stem cells become specialized:

Identified intermediate cell states
Revealed decision points in cell fate
Guided strategies for lab-grown tissues

🚀 The Future: Where Are We Headed?

Multi-Omics Integration

Imagine measuring not just RNA, but also:

Proteins (what's actually doing the work)
Chromatin accessibility (which genes CAN be turned on)
DNA methylation (epigenetic memory)
TCR/BCR sequences (immune receptor repertoire)

All in the SAME cell. This is happening now!

Spatial Transcriptomics

The next frontier: measuring gene expression while preserving WHERE each cell sits in the tissue:

        Traditional scRNA-seq          Spatial

             ●●●●●●                   🔵🔵🟢🔵🔵
             ●●●●●●                   🔵🔵🟢🔵🔵
        (lost spatial info)          🟢🟢🟢🟢🟢
                                     🔴🔴🔴🔴🔴
                                (preserved positions!)

This reveals:

Cell-cell communication networks
Tissue architecture
Microenvironments
Gradients and zones

AI-Powered Analysis

Machine learning models are being trained on millions of cells to:

Automatically annotate any cell from any tissue
Predict drug responses
Identify disease signatures
Generate hypotheses

Real-Time Clinical Diagnostics

The dream: rapid scRNA-seq in hospitals:

Patient presents with immune disorder
Draw blood
scRNA-seq analysis in 24 hours
Precise diagnosis and personalized treatment plan

🤔 The Philosophy: Why Small Things Matter

There's something profound about single-cell analysis. For centuries, we studied populations, averages, bulks. We asked "What does THE cell do?" as if there was one representative of each type.

Single-cell analysis taught us humility. There is no "THE cell." Every cell is an individual with its own history, state, and destiny. The T cell next to its sibling might be:

Responding to infection while its neighbor rests
Expressing different genes despite identical DNA
On a path to memory while its twin approaches exhaustion

This is true for cells, and perhaps, it's true for life. We are not averages. We are distributions. We are heterogeneous. And that heterogeneity is not noise—it IS the biology.

When we measure 10,000 cells individually, we're not just generating data. We're acknowledging that:

Rare cells matter (the 0.1% might save your life)
Transitions matter (cells in-between states reveal mechanisms)
Variation matters (diversity enables adaptation)

🎓 Getting Started: Your Own Single-Cell Journey

Want to try this yourself? Here's your roadmap:

1. Learn Python Basics (2-4 weeks)

Variables, loops, functions
Pandas for data manipulation
Matplotlib for visualization

2. Understand the Biology (ongoing)

Cell biology fundamentals
Immunology basics (if analyzing immune cells)
Read reviews on scRNA-seq

3. Work Through Tutorials (4-8 weeks)

Scanpy tutorials
Seurat tutorials (R alternative)
Published workflows

4. Analyze Public Data (2-3 months)

10x Genomics public datasets
Human Cell Atlas
GEO database

5. Design Your Own Experiment (∞)

Define biological question
Plan experimental design
Collaborate with experts

📚 Resources

Essential Reading:

"Current best practices in single-cell RNA-seq analysis" (Luecken & Theis, 2019)
Scanpy documentation
Single Cell Best Practices online book

Datasets to Practice:

Communities:

Biostars forum
r/bioinformatics
Single-cell analysis Slack/Discord channels

🎼 Final Thoughts: The Symphony of Life

When I first looked at a UMAP plot of 10,000 cells, each dot representing an individual cell, I was struck by the beauty of it. These aren't just data points. Each one is a living entity with:

Billions of proteins working in concert
Thousands of molecular machines running simultaneously
A complete copy of your genome deciding what to become
A past (where it came from) and future (what it will become)

And we can listen to each one's unique molecular signature—the genes it chooses to express at this moment in time.

It's like conducting a 10,000-piece orchestra where:

Every musician plays their own score
The music changes moment by moment
Patterns emerge from seeming chaos
And somehow, together, they create the symphony of life

That's what single-cell RNA sequencing lets us do: listen to the individual voices while appreciating the collective harmony.

Your immune system—right now, as you read this—contains billions of these cells. Each one performing its role, some fighting infections you'll never know you had, some remembering that cold from three years ago, some just quietly waiting for their moment to act.

We're just beginning to learn their stories. And with every cell we sequence, every algorithm we refine, every biological insight we gain, we get a little closer to understanding the fundamental question:

What does it mean to be alive, one cell at a time?