DEV Community

Cover image for Decoding Life One Cell at a Time: A Journey Through Single-Cell RNA Sequencing
Awosise Oluwaseun
Awosise Oluwaseun

Posted on

Decoding Life One Cell at a Time: A Journey Through Single-Cell RNA Sequencing

When 10,000 Cells Tell 10,000 Different Stories

Imagine you're trying to understand a bustling city. You could look at it from space and see general patternsโ€”where the downtown is, where the suburbs spread out. But to really understand the city, you need to meet its inhabitants: the barista brewing your morning coffee, the doctor saving lives at the hospital, the teacher shaping young minds. Each person has a unique story, a unique purpose.

Your body is that city, and cells are its inhabitants. For decades, scientists studied cells in bulkโ€”like viewing that city from space. But what if we could listen to each cell individually? What if every cell could tell us its unique story through the genes it's expressing at that very moment?

Welcome to the world of single-cell RNA sequencing (scRNA-seq), where we do exactly that.


๐ŸŽต The Power of Listening to Individual Voices

Traditional bulk RNA sequencing is like recording a symphony orchestra but only having one microphone placed in the middle of the concert hall. You hear music, sure, but can you distinguish the violins from the cellos? The flutes from the trumpets?

Single-cell RNA-seq is like giving each musician their own microphone. Suddenly, you can:

  • โœ… Identify who's playing what
  • โœ… Discover that one amazing violinist playing a slightly different melody
  • โœ… Realize the person you thought was a trumpeter is actually playing the saxophone

In our bodies, this means discovering:

  • Rare cell types that were invisible in bulk measurements
  • Cell states (like cells transitioning from resting to activated)
  • Disease mechanisms hidden within specific cell populations
  • How cells respond differently to the same treatment

๐ŸŽฏ Our Mission: Decoding Human Blood Cells

For this journey, we're analyzing peripheral blood mononuclear cells (PBMCs)โ€”the immune cells floating in your bloodstream right now. These are your body's defense force: the soldiers, scouts, and command centers of your immune system.

We start with approximately 10,000 cells captured using 10x Genomics technology, which works like this:

Individual Cell โ†’ Droplet โ†’ Barcode โ†’ Sequence โ†’ Digital "Fingerprint"
Enter fullscreen mode Exit fullscreen mode

Each cell gets trapped in its own microscopic droplet with a unique barcode, like giving each person a numbered ticket. When we sequence the RNA, we know exactly which cell each piece of genetic code came from.

๐Ÿ“Š The Data We're Working With

๐Ÿ“Š Dataset: 10,000+ individual PBMC cells
๐Ÿงฌ Genes tracked: ~20,000 per cell
๐Ÿ“ˆ Total measurements: 200+ million data points
๐Ÿ’พ File size: Several hundred megabytes
Enter fullscreen mode Exit fullscreen mode

Think about that for a second: we're measuring the activity of 20,000 genes in 10,000 individual cells. That's like reading 200 million different sentences to understand the story of your immune system!

๐Ÿ““ Full Tutorial: The complete Jupyter notebook with all code and visualizations is available on GitHub. Follow along and run the analysis yourself!


Part 1: The Setup - Building Our Laboratory in the Cloud โ˜๏ธ

Since we're working with massive datasets, we need a computational laboratory. We're using Google Colab, which is essentially a computer in the cloud that we can access from anywhere.

Why Google Drive Integration Matters

from google.colab import drive
drive.mount('/content/drive')
Enter fullscreen mode Exit fullscreen mode

This simple act of "mounting" Google Drive is like claiming a permanent workspace. Without it, every time our coding session ends (Colab disconnects after a few hours of inactivity), we'd lose everythingโ€”installed packages, downloaded data, analysis results. It's like being a scientist whose lab vanishes every evening and reappears empty the next morning!

๐Ÿ”ง Our Toolkit: The Scientific Instruments

We're assembling quite the collection:

install_if_missing("scanpy")      # The Swiss Army knife of scRNA-seq
install_if_missing("anndata")     # Our data container
install_if_missing("igraph")      # For finding communities
install_if_missing("celltypist")  # AI-powered cell identification
install_if_missing("decoupler")   # Pattern recognition engine
Enter fullscreen mode Exit fullscreen mode

Each library is like a specialized instrument:

  • Scanpy: Your all-in-one analysis platform (like a high-powered microscope)
  • AnnData: The filing system that keeps everything organized
  • igraph: A social network analyzer for cells (who hangs out with whom?)
  • Celltypist: An AI that learned what different cell types look like
  • Decoupler: A pattern matcher trained on thousands of experiments

GitHub logo scverse / scanpy

Single-cell analysis in Python. Scales to >100M cells.

Stars PyPI PyPI Downloads Conda Forge Conda Forge Downloads Docs CI Discourse topics Chat Powered by NumFOCUS

Scanpy โ€“ Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing The Python-based implementation efficiently deals with datasets of more than one million cells. For datasets too large to fit into memory, many scanpy functions are now compatible with dask (warning: experimental).

Discuss usage on the scverse Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contribution guide.

scanpy is part of the scverseยฎ project (website, governance) and is fiscally sponsored by NumFOCUS If you like scverseยฎ and want to support our mission, please consider making a tax-deductible donation to help the project pay for developer time, professional services, travel, workshops, and a varietyโ€ฆ


Part 2: Quality Control - Separating Signal from Noise ๐Ÿ”

Not every cell we capture is high quality. Some are damaged during collection, some are actually two cells stuck together (doublets), and some are dying as we speak. We need to filter these out.

The Three Red Flags ๐Ÿšฉ

1. Mitochondrial Genes (The Dying Cell Signature)

fresh_blood_adata.var['MT'] = fresh_blood_adata.var_names.str.startswith("MT-")
Enter fullscreen mode Exit fullscreen mode

Why this matters: Imagine a sinking ship. The hull breaches, and water rushes in. The cargo floats away, but the heavy machinery stays on board. That's what happens to dying cells:

  • The cell membrane becomes leaky (the hull breaches)
  • Cytoplasmic RNA escapes (cargo floats away)
  • Mitochondrial RNA stays trapped inside (heavy machinery)

If more than 10-20% of a cell's RNA is mitochondrial, it's probably dying. We mark these cells for exclusion.

Healthy Cell:
๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ด (10% MT = good)

Dying Cell:
๐Ÿ”ต๐Ÿ”ต๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด (80% MT = bad)
Enter fullscreen mode Exit fullscreen mode

2. Ribosomal Genes (The House Cleaning Noise)

Ribosomal proteins are like the cleaning staff of the cellโ€”essential but not particularly informative about what makes a T cell different from a B cell. Too much ribosomal RNA can drown out the interesting biological signals.

Why this matters: Imagine you're trying to identify different professions by listening to what people talk about:

  • Doctor: "surgery," "diagnosis," "patient care" โœ“ (distinctive)
  • Teacher: "curriculum," "students," "lesson plans" โœ“ (distinctive)
  • Everyone: "commute," "lunch," "email," "meetings" โœ— (not distinctive)

Ribosomal genes are the cellular equivalent of "everyone talks about this." They're housekeeping genes involved in protein synthesisโ€”absolutely critical for cell survival, but expressed at high levels in ALL cells regardless of type.

The ribosomal noise problem:

Gene Expression Profile:

Biological Signal Genes:         Ribosomal Genes:
CD3D, CD4, CD8 (T cell markers)  RPL/RPS genes (protein machinery)

๐Ÿ“Š PROBLEMATIC CELL (High Ribo):
Ribosomal: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 70%
T markers:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 20%
Other:      โ–ˆโ–ˆโ–ˆ 10%
           โ†‘ Signal buried in noise!

๐Ÿ“Š CLEAN CELL (Low Ribo):
T markers:  โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 60%
Other:      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 30%
Ribosomal: โ–ˆโ–ˆโ–ˆ 10%
           โ†‘ Signal clearly visible!
Enter fullscreen mode Exit fullscreen mode

Common ribosomal genes we filter:

RPL3, RPL4, RPL5... (Large ribosomal subunit proteins)
RPS2, RPS3, RPS4... (Small ribosomal subunit proteins)

These can represent 30-50% of total mRNA in some cells!
Enter fullscreen mode Exit fullscreen mode

Visual comparison:

Cell Type Distinction WITHOUT Ribo Filtering:

     Expression Level
        โ”‚
   High โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ  โ† All cells look similar
        โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ     (dominated by ribosomal signal)
    Low โ”‚ โ–‘โ–‘โ–‘โ–‘ โ–‘โ–‘โ–‘โ–‘ โ–‘โ–‘โ–‘โ–‘
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
         T    B    NK

Cell Type Distinction WITH Ribo Filtering:

     Expression Level
        โ”‚
   High โ”‚ โ–ˆโ–ˆโ–ˆโ–ˆ           โ† Clear differences!
        โ”‚ โ–‘โ–‘โ–‘โ–‘ โ–ˆโ–ˆโ–ˆโ–ˆ      (distinctive markers visible)
    Low โ”‚ โ–‘โ–‘โ–‘โ–‘ โ–‘โ–‘โ–‘โ–‘ โ–ˆโ–ˆโ–ˆโ–ˆ
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
         T    B    NK
        CD3  CD79 NKG7
Enter fullscreen mode Exit fullscreen mode

The filtering strategy:

# Mark ribosomal genes
fresh_blood_adata.var['RIBO'] = fresh_blood_adata.var_names.str.startswith(('RPS', 'RPL'))

# Remove cells with >2% ribosomal content
adata = adata[adata.obs['pct_counts_RIBO'] < 2]
Enter fullscreen mode Exit fullscreen mode

Why 2% threshold?

Distribution of Ribosomal RNA Content:

Cellsโ”‚     Normal        Technical artifacts
     โ”‚       โ•ฑโ•ฒ          โ•ฑโ•ฒ
     โ”‚      โ•ฑ  โ•ฒ        โ•ฑ  โ•ฒ
     โ”‚     โ•ฑ    โ•ฒ______โ•ฑ    โ•ฒ___
     โ”‚____โ•ฑ                      โ•ฒ____
     0%   1%   2%   3%   4%   5%  โ†’
          โ†‘
          2% cutoff
     Keep โœ“   Remove โŒ

Most healthy cells: 0.5-1.5% ribosomal RNA
Problematic cells: >2% (technical issues during capture)
Enter fullscreen mode Exit fullscreen mode

3. Hemoglobin Genes (The Contamination Signal)

In blood samples, red blood cells can break open and spill their hemoglobin everywhere. It's like trying to study fish in an aquarium but the water is cloudy with algae. We need to filter this contamination.

Why this matters: Red blood cells (erythrocytes) are the most abundant cells in bloodโ€”about 5 million per microliter! But they're not immune cells and shouldn't be in our PBMC sample. When they lyse (break open) during sample processing, they release massive amounts of hemoglobin that can contaminate other cells.

The contamination cascade:

DURING SAMPLE PROCESSING:

Step 1: Blood draw
  ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด  โ† Millions of red blood cells
  ๐Ÿ”ต ๐ŸŸข ๐ŸŸก        โ† PBMCs (what we want)

Step 2: Density gradient separation
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  ๐Ÿ”ต ๐ŸŸข ๐ŸŸก        โ† PBMCs float (good!)
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด      โ† RBCs sink (good!)

Step 3: Collection (sometimes incomplete)
  ๐Ÿ”ต ๐ŸŸข ๐ŸŸก ๐Ÿ’ฅ๐Ÿ”ด  โ† Some RBCs break open
               โ†“
         Hemoglobin spills out!

Step 4: Hemoglobin contamination
  ๐Ÿ”ต๐Ÿ’ง ๐ŸŸข๐Ÿ’ง ๐ŸŸก   โ† PBMCs coated with hemoglobin
     โ†‘
  HBA, HBB genes detected in immune cells (BAD!)
Enter fullscreen mode Exit fullscreen mode

Hemoglobin gene expression pattern:

๐Ÿ“Š CONTAMINATED CELL:
Hemoglobin: โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 50%  โ† Red flag!
(HBA1, HBA2, HBB, HBD)
Immune:     โ–ˆโ–ˆโ–ˆโ–ˆ 10%
Other:      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 40%

๐Ÿ“Š CLEAN IMMUNE CELL:
Hemoglobin: โ–‘ <0.1%  โ† Should be near zero!
Immune:     โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 70%
Other:      โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ 30%
Enter fullscreen mode Exit fullscreen mode

Common hemoglobin genes:

HBA1  โ”€โ”
HBA2  โ”€โ”ค Alpha globin chains
       โ”‚ (Should NOT be in lymphocytes!)
HBB   โ”€โ”ค
HBD   โ”€โ”˜ Beta globin chains
Enter fullscreen mode Exit fullscreen mode

The detection pattern:

UMAP Plot - Before Hemoglobin Filtering:

        โ”‚  ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต          โ† T cells (good)
        โ”‚  ๐Ÿ”ต๐Ÿ”ต
        โ”‚
        โ”‚        ๐ŸŸข๐ŸŸข      โ† B cells (good)
        โ”‚        ๐ŸŸข๐ŸŸข
        โ”‚
        โ”‚              โŒโŒโŒ  โ† Mystery cluster
        โ”‚              โŒโŒโŒ     (HIGH HBA/HBB!)
        โ”‚              โŒโŒโŒ
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Gene Expression in Mystery Cluster:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘ Gene      | % Expressed | Avg Exp โ•‘
โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
โ•‘ HBA1      |    95%      |  9.2    โ•‘  ๐Ÿšจ
โ•‘ HBA2      |    92%      |  8.8    โ•‘  ๐Ÿšจ
โ•‘ HBB       |    89%      |  8.1    โ•‘  ๐Ÿšจ
โ•‘ CD3D      |     1%      |  0.1    โ•‘  (no T markers)
โ•‘ CD79A     |     2%      |  0.1    โ•‘  (no B markers)
โ•‘ CD14      |     3%      |  0.2    โ•‘  (no myeloid)
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Conclusion: Red blood cell contamination or ambient RNA!
Enter fullscreen mode Exit fullscreen mode

Ambient RNA problem:

What is "ambient RNA"?

During scRNA-seq:

  Droplet with healthy cell:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  ๐Ÿ”ต         โ”‚ โ† Immune cell
  โ”‚             โ”‚
  โ”‚  Clean!     โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

  Droplet with contamination:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  ๐Ÿ”ต  ๐Ÿ’ง๐Ÿ’ง   โ”‚ โ† Cell + hemoglobin
  โ”‚      ๐Ÿ’ง๐Ÿ’ง   โ”‚    floating in solution
  โ”‚  Problem!   โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The hemoglobin RNA gets captured along with the cell's RNA!
Enter fullscreen mode Exit fullscreen mode

Filtering strategy:

# Mark hemoglobin genes
fresh_blood_adata.var['HB'] = fresh_blood_adata.var_names.str.contains('^HB[^(P)]')

# Remove cells with >0.1% hemoglobin
adata = adata[adata.obs['pct_counts_HB'] < 0.1]
Enter fullscreen mode Exit fullscreen mode

Why 0.1% threshold?

Hemoglobin Content Distribution:

Cellsโ”‚     Clean cells    Contaminated
     โ”‚                    
     โ”‚ โ•ฑโ•ฒ                 
     โ”‚โ•ฑ  โ•ฒ                โ—  โ—
     โ”‚    โ•ฒ___           โ— โ— โ—
     โ”‚        โ•ฒ_________โ—______โ—___
     0%  0.05% 0.1%  0.5%  1%  5%  โ†’
               โ†‘
            0.1% cutoff
        Keep โœ“    Remove โŒ

Clean immune cells: <0.05% hemoglobin
Contaminated cells: >0.1% hemoglobin
Pure RBC debris: >1% hemoglobin
Enter fullscreen mode Exit fullscreen mode

Real-world impact:

CASE STUDY: The Phantom Cluster

Initial Analysis (no HB filtering):
  15 clusters detected
  Cluster 12: ~400 cells

Cluster 12 characteristics:
  โœ— No T cell markers
  โœ— No B cell markers  
  โœ— No NK markers
  โœ— No monocyte markers
  โœ“ High HBA1, HBA2, HBB

Gene Ontology Enrichment:
  โ†’ "oxygen transport"
  โ†’ "hemoglobin complex"
  โ†’ "erythrocyte differentiation"

Conclusion: Not a real immune cell type!

After HB filtering:
  14 clusters detected โœ“
  Cluster 12: REMOVED
  Clean analysis of actual immune cells!
Enter fullscreen mode Exit fullscreen mode

The triple filter visualization:

Quality Control Cascade:

START: 10,000 cells
  โ”‚
  โ”œโ”€ Filter MT > 20% โ”€โ”€โ”€โ”€โ†’ Remove 800 dying cells
  โ”‚
  โ”œโ”€ Filter RIBO > 2% โ”€โ”€โ”€โ†’ Remove 500 noisy cells
  โ”‚
  โ”œโ”€ Filter HB > 0.1% โ”€โ”€โ”€โ†’ Remove 400 contaminated cells
  โ”‚
  โ†“
END: 8,300 high-quality cells โœ“

Cell Quality Score Formula:
Q = -wโ‚(MT%) - wโ‚‚(RIBO%) - wโ‚ƒ(HB%) + wโ‚„(n_genes)

Where:
  wโ‚ = 2.0  (MT% most important - indicates death)
  wโ‚‚ = 1.0  (RIBO% moderate - indicates noise)
  wโ‚ƒ = 1.5  (HB% high importance - indicates contamination)
  wโ‚„ = 0.5  (n_genes - more genes = better)
Enter fullscreen mode Exit fullscreen mode

Biological interpretation:

Why each filter matters for cell type identification:

WITHOUT FILTERS:                WITH FILTERS:

  "T cell" cluster               True T cell cluster
  Mixed signals:                 Clear signal:
  โ”œโ”€ CD3D (real T marker)       โ”œโ”€ CD3D โœ“
  โ”œโ”€ HBA1 (contamination)       โ”œโ”€ CD3E โœ“
  โ”œโ”€ RPL3 (noise)               โ”œโ”€ CD4 โœ“
  โ””โ”€ MT-CO1 (dying cells)       โ””โ”€ IL7R โœ“

  Confusing! โœ—                  Perfect! โœ“
Enter fullscreen mode Exit fullscreen mode

The Quality Control Thresholds: Finding the Sweet Spot

After identifying these red flags, we need to decide where to draw the line. This is like airport security deciding how suspicious someone needs to be before additional screening.

# Set our quality control thresholds
fresh_blood_adata = fresh_blood_adata[
     (fresh_blood_adata.obs['pct_counts_MT'] < 20) &
     (fresh_blood_adata.obs['pct_counts_RIBO'] < 2) &
     (fresh_blood_adata.obs['pct_counts_HB'] < 0.1) &
     (fresh_blood_adata.obs['n_genes_by_counts'] > 200) &
     (fresh_blood_adata.obs['n_genes_by_counts'] < 6000)
]
Enter fullscreen mode Exit fullscreen mode

Visual representation of our filtering criteria:

                         CELL QUALITY SPECTRUM

     โŒ TOO LOW                    โœ… JUST RIGHT                    โŒ TOO HIGH

     <200 genes              200-6000 genes detected           >6000 genes
     (empty/broken)          (healthy cells)                   (doublets)

     โ”Š                       โ”Š                                 โ”Š
     โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–‘โ–“
     โ”Š                       โ”Š                                 โ”Š

     Mitochondrial RNA %:

     โŒ >20%                 โœ… <20%
     (dying cell)            (healthy cell)

     ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด     ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต๐Ÿ”ด๐Ÿ”ด
     (80% MT-bad)            (20% MT-good)
Enter fullscreen mode Exit fullscreen mode

The Filtering Impact: Before and After

Here's what happens to our dataset:

๐Ÿ“Š BEFORE FILTERING:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  Total cells: 10,000           โ•‘
โ•‘  Mean genes/cell: 2,847        โ•‘
โ•‘  Median MT%: 4.2%              โ•‘
โ•‘  Suspected doublets: ~500      โ•‘
โ•‘  Low quality: ~800             โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

           โ†“ APPLY FILTERS โ†“

๐Ÿ“Š AFTER FILTERING:
โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘  Total cells: 8,200 โœ“          โ•‘
โ•‘  Mean genes/cell: 3,124 โœ“      โ•‘
โ•‘  Median MT%: 3.1% โœ“            โ•‘
โ•‘  Clean, high-quality data! ๐ŸŽ‰  โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Enter fullscreen mode Exit fullscreen mode

What we removed:

  • ~500 doublets (two cells masquerading as one)
  • ~800 low-quality/dying cells (high MT%, low gene counts)
  • ~500 contaminated cells (high hemoglobin)

Why this matters: Garbage in, garbage out. If we include dying cells in our T cell cluster, we might falsely conclude that T cells express death-associated genes when they actually don't!

Interactive Visualization: The Filtering Dashboard

Imagine looking at a scatter plot where each point is a cell:

          Mitochondrial % vs Genes Detected

     MT%
     30โ”‚                           โŒ โ† dying cells
       โ”‚                    โŒ  โŒ
     25โ”‚              โŒ โŒ     โŒ
       โ”‚          โŒ          โŒ
     20โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Filter threshold
       โ”‚    โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
     15โ”‚  โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
       โ”‚ โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“     โ† healthy cells
     10โ”‚โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
       โ”‚โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
      5โ”‚โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
       โ”‚โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“โœ“
      0โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Genes detected
       0   1k  2k  3k  4k  5k  6k  7k
                                           โ†‘
                                   doublets โŒ
Enter fullscreen mode Exit fullscreen mode

The beauty of this scatter plot:

  • Bottom left: Empty droplets/debris (low genes, low MT)
  • Middle band: Healthy cells (moderate genes, low MT) โœ“
  • Top right: Dying cells (variable genes, high MT)
  • Far right: Doublets (too many genes)

Gene-Level Filtering: Removing Uninformative Genes

We also filter genes, not just cells:

# Remove genes detected in fewer than 3 cells
sc.pp.filter_genes(fresh_blood_adata, min_cells=3)
Enter fullscreen mode Exit fullscreen mode

Why remove rare genes?

Gene X detected in 2 cells out of 8,200:

Scenario 1: Technical noise
  Cell_0042: "X" โ† Sequencing error
  Cell_5891: "X" โ† Random spurious detection

Scenario 2: Extremely rare (can't analyze statistically)
  Even if real, 2 cells is insufficient for robust analysis
Enter fullscreen mode Exit fullscreen mode

The filtering cascade:

          GENE FILTERING CASCADE

Start: 20,000 genes
  โ†“
Remove genes in <3 cells
  โ†“ (remove ~5,000 rare/noise genes)
  โ†“
15,000 genes remain
  โ†“
Keep highly variable genes (later step)
  โ†“ (select ~2,000-3,000 informative genes)
  โ†“
Final: 2,000 genes for analysis

This is like going from every word in the dictionary
to just the keywords that define each cell type!
Enter fullscreen mode Exit fullscreen mode

The Scrublet Score Distribution

When we run doublet detection, we get a score for each cell:

     Doublet Score Distribution

Cellsโ”‚     Real cells        Doublets
      โ”‚        โ•ฑโ•ฒ               โ•ฑโ•ฒ
      โ”‚       โ•ฑ  โ•ฒ             โ•ฑ  โ•ฒ
      โ”‚      โ•ฑ    โ•ฒ___________โ•ฑ    โ•ฒ
      โ”‚     โ•ฑ                       โ•ฒ
      โ”‚____โ•ฑ_________________________โ•ฒ___
      0.0  0.1  0.2  0.3  0.4  0.5  0.6
                 Doublet Score โ†’

      โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”คโ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
       Keep โœ“    Remove (threshold) โŒ
Enter fullscreen mode Exit fullscreen mode

Reading this plot:

  • Left peak: Single cells (low doublet scores) โœ“
  • Right peak/tail: Likely doublets (high scores) โŒ
  • Threshold (typically 0.25-0.30): Decision boundary

The Complete Quality Control Workflow Visualized

                         QC WORKFLOW

     ๐Ÿ“ฅ Raw Data
     10,000 cells
           โ”‚
           โ”œโ”€โ”€โ†’ [Calculate QC metrics]
           โ”‚    โ€ข Count genes per cell
           โ”‚    โ€ข Calculate MT%
           โ”‚    โ€ข Calculate RIBO%
           โ”‚    โ€ข Calculate HB%
           โ”‚
           โ”œโ”€โ”€โ†’ [Detect doublets]
           โ”‚    โ€ข Run Scrublet
           โ”‚    โ€ข Flag suspicious cells
           โ”‚
           โ”œโ”€โ”€โ†’ [Visualize distributions]
           โ”‚    โ€ข Violin plots
           โ”‚    โ€ข Scatter plots
           โ”‚    โ€ข Histograms
           โ”‚
           โ”œโ”€โ”€โ†’ [Set thresholds]
           โ”‚    โ€ข MT% < 20%
           โ”‚    โ€ข Genes: 200-6000
           โ”‚    โ€ข RIBO% < 2%
           โ”‚    โ€ข HB% < 0.1%
           โ”‚    โ€ข Doublet score < 0.25
           โ”‚
           โ”œโ”€โ”€โ†’ [Apply filters]
           โ”‚    โ€ข Remove 1,800 cells
           โ”‚
           โ†“
     โœ… Clean Data
     8,200 high-quality cells
           โ”‚
           โ†“
     [Continue to normalization...]
Enter fullscreen mode Exit fullscreen mode

Real Example: The Contamination Story

Here's a real scenario from our analysis:

CLUSTER 7: The Mystery Cluster

Initial observation:
  โ€ข 450 cells
  โ€ข High hemoglobin genes (HBA1, HBA2, HBB)
  โ€ข Low immune markers
  โ€ข Suspiciously uniform

Investigation:
  โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
  โ•‘ Gene      | % Expressed | Avg Exp โ•‘
  โ• โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•ฃ
  โ•‘ HBA1      |    92%      |  8.7    โ•‘  โ† Red flag!
  โ•‘ HBA2      |    89%      |  8.2    โ•‘  โ† Red flag!
  โ•‘ HBB       |    87%      |  7.9    โ•‘  โ† Red flag!
  โ•‘ CD3D      |     2%      |  0.1    โ•‘  โ† Should be higher
  โ•‘ CD79A     |     1%      |  0.0    โ•‘  โ† Should be higher
  โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

Conclusion: Red blood cell contamination!

Action: Filter out this cluster entirely

Result: Clean immune cell analysis โœ“
Enter fullscreen mode Exit fullscreen mode

The Mathematical Beauty of QC

Quality control involves elegant statistical reasoning:

For each cell, we calculate a quality score:

Q = wโ‚(genes_normalized) - wโ‚‚(MT_pct) - wโ‚ƒ(doublet_score)

Where:
  genes_normalized = (genes - min) / (max - min)
  MT_pct = mitochondrial percentage
  doublet_score = Scrublet output
  wโ‚, wโ‚‚, wโ‚ƒ = weights (importance of each factor)

High Q score โ†’ Keep cell โœ“
Low Q score โ†’ Remove cell โŒ

Visual representation:

Quality Score Distribution:

      โ”‚        Keep these โ†’  โ•ฑโ•ฒ
Cellsโ”‚                     โ•ฑ  โ•ฒ
      โ”‚                    โ•ฑ    โ•ฒ___
      โ”‚    Remove these โ†’ โ•ฑ          โ•ฒ
      โ”‚___________________โ•ฑ____________โ•ฒ____
      Low                              High
                 Quality Score โ†’
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“ˆ Visualizing Quality Metrics

Here's what our quality control plots reveal:

        Number of Genes Detected
             ๐ŸŽป
        โ”‚    โ•ฑ โ•ฒ
        โ”‚   โ•ฑ   โ•ฒ
   Cellsโ”‚  โ•ฑ     โ•ฒ___
        โ”‚ โ•ฑ          โ•ฒ___
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
           0  2000  4000  6000
Enter fullscreen mode Exit fullscreen mode

This violin plot shows:

  • โœ… Most cells: 2,000-4,000 genes detected
  • โŒ Low-quality cells: <1,000 genes detected
  • โš ๏ธ Possible doublets: >6,000 genes detected

๐ŸŽญ Doublet Detection: Finding Cellular Impostors

sc.pp.scrublet(fresh_blood_adata)
Enter fullscreen mode Exit fullscreen mode

Doublets are when two cells end up in the same droplet. It's like conducting interviews but accidentally interviewing two people at onceโ€”their combined answers make no sense!

How Scrublet works:

  1. Simulates artificial doublets by merging random pairs of real cells
  2. Compares each real cell to these fake doublets
  3. Flags cells that look suspiciously similar to the fakes
Cell A + Cell B = Doublet
  ๐Ÿ”ต    +   ๐ŸŸข   =   ๐Ÿ”ต๐ŸŸข
 (T cell)  (B cell)  (Fake "hybrid" cell)
Enter fullscreen mode Exit fullscreen mode

Part 3: Normalization - Making Apples Comparable to Apples ๐ŸŽ

Different cells captured different amounts of RNAโ€”not because they're biologically different, but due to random technical variation. Some droplets caught "talkative" cells, others caught "quiet" ones.

The problem: Cell A might have 50,000 RNA molecules, Cell B only 5,000. Does this mean Cell A is more active? Not necessarily! Cell A might have just been captured more efficiently.

The solution: Normalize!

# Scale each cell to the same total count
sc.pp.normalize_total(fresh_blood_adata)
# Log-transform to stabilize variance
sc.pp.log1p(fresh_blood_adata)
Enter fullscreen mode Exit fullscreen mode

This is like adjusting volume levels so every musician can be heard equally. Now we can compare their actual melodies, not just their loudness.

Why Log Transformation?

Gene expression follows a power law distributionโ€”a few genes are massively expressed, most are moderate or low. Log transformation compresses this range so we can see patterns in both highly and lowly expressed genes.

Before log:
Gene A: โ–
Gene B: โ–
Gene C: โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“โ–“  (drowns out everything!)

After log:
Gene A: โ–
Gene B: โ–โ–
Gene C: โ–โ–โ–โ–  (all visible!)
Enter fullscreen mode Exit fullscreen mode

Part 4: Feature Selection - Finding the Genes That Matter ๐ŸŽฏ

Out of 20,000 genes, which ones actually help us distinguish cell types?

Most genes are "housekeeping genes"โ€”genes that every cell needs to survive, like the genes for making ribosomes or processing energy. These genes are like everyone in the city wearing pantsโ€”sure, it's universal, but it doesn't tell you whether someone is a doctor, teacher, or artist.

We want the distinctive genesโ€”the ones that make cell types unique:

  • CD3D, CD3E โ†’ T cell markers (like a doctor's stethoscope)
  • CD79A, CD79B โ†’ B cell markers (like a teacher's chalk)
  • NKG7, GNLY โ†’ NK cell markers (like a soldier's uniform)
sc.pp.highly_variable_genes(fresh_blood_adata, n_top_genes=1000)
Enter fullscreen mode Exit fullscreen mode

This selects the top 1,000 most variable genesโ€”genes that change dramatically between cells, indicating biological differences rather than random noise.

๐Ÿ“Š Highly Variable Genes Plot:

   Dispersion
      โ†‘
   โ—  โ”‚  โ—  Selected genes
   โ—  โ”‚ โ—โ— โ—  (high variance)
 โ—    โ”‚ โ— โ—
  โ— โ— โ”‚
โ”€โ”€โ”€โ”€โ—โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Mean Expression
      โ”‚ โ—‹ โ—‹ โ—‹  Not selected
      โ”‚โ—‹  โ—‹    (low variance)
Enter fullscreen mode Exit fullscreen mode

Part 5: Dimensionality Reduction - From 1,000 Dimensions to 2D ๐Ÿ“

We now have 1,000 genes per cell. That's 1,000 dimensionsโ€”impossible to visualize! Imagine trying to draw a 1,000-dimensional graph. Your brain would melt.

Enter PCA (Principal Component Analysis)

PCA is like finding the best camera angles to photograph a complex 3D sculpture. Instead of taking 1,000 photos from random angles, we find the 30-50 angles that show the most interesting features.

sc.tl.pca(fresh_blood_adata)
Enter fullscreen mode Exit fullscreen mode

How PCA works (simplified):

  1. Find the direction with the most variation (PC1)
  2. Find the next direction with the most variation perpendicular to PC1 (PC2)
  3. Continue for 50 components
Original 1,000 dimensions:
๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ... (ร—100)

Compressed to 50 PCs:
๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ๐ŸŽฏ (captures 80%+ of the information!)
Enter fullscreen mode Exit fullscreen mode

UMAP: The Final Visualization

PCA gives us 50 dimensionsโ€”better, but still not visualizable. Enter UMAP (Uniform Manifold Approximation and Projection):

sc.tl.umap(fresh_blood_adata)
Enter fullscreen mode Exit fullscreen mode

UMAP is like taking a complex network map of a city with all its roads, buildings, and connections, and drawing a simplified 2D map that preserves the important relationships:

  • Neighborhoods stay together
  • Connected areas remain connected
  • Distances approximately preserved
         UMAP Plot
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚  ๐Ÿ”ต๐Ÿ”ต๐Ÿ”ต         โ”‚
    โ”‚  ๐Ÿ”ต๐Ÿ”ต           โ”‚
    โ”‚                โ”‚
    โ”‚        ๐ŸŸข๐ŸŸข๐ŸŸข  โ”‚
    โ”‚        ๐ŸŸข๐ŸŸข    โ”‚
    โ”‚                โ”‚
    โ”‚ ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด          โ”‚
    โ”‚ ๐Ÿ”ด๐Ÿ”ด            โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ

๐Ÿ”ต = T cells
๐ŸŸข = B cells
๐Ÿ”ด = NK cells
Enter fullscreen mode Exit fullscreen mode

Part 6: Clustering - Discovering Cell Types ๐Ÿ”ฌ

Now we can see clear islands of cells in our UMAP plot. But which island is which cell type?

The Leiden Algorithm: Finding Communities

sc.tl.leiden(fresh_blood_adata, resolution=0.5)
Enter fullscreen mode Exit fullscreen mode

Think of your cells as people at a conference. The Leiden algorithm is like observing who talks to whom:

  • People cluster into groups (research fields)
  • Similar people group together naturally
  • Some people bridge multiple groups

How it works:

  1. Build a network where similar cells are connected
  2. Find communities where connections are dense internally but sparse externally
  3. Iteratively optimize these communities
Network Graph:

T cells:        B cells:
 โ—‹โ”€โ—‹โ”€โ—‹          โ—‹โ”€โ—‹
 โ”‚ โ”‚ โ”‚          โ”‚ โ”‚
 โ—‹โ”€โ—‹โ”€โ—‹          โ—‹โ”€โ—‹
   โ†“ (weak connection)
NK cells:
 โ—‹โ”€โ—‹โ”€โ—‹
 โ”‚ โ”‚ โ”‚
 โ—‹โ”€โ—‹โ”€โ—‹
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Resolution: Choosing the Right Magnification

The resolution parameter is like choosing how close to zoom in:

# Low resolution (0.02): See major cell lineages
sc.tl.leiden(adata, resolution=0.02)  
# โ†’ 3-5 clusters (T cells, B cells, Myeloid)

# Medium resolution (0.5): See cell types
sc.tl.leiden(adata, resolution=0.5)   
# โ†’ 10-15 clusters (CD4+ T, CD8+ T, Memory B, Naive B...)

# High resolution (2.0): See cell subtypes
sc.tl.leiden(adata, resolution=2.0)   
# โ†’ 30+ clusters (Th1, Th2, Th17, Tregs...)
Enter fullscreen mode Exit fullscreen mode

It's like looking at a forest:

  • ๐ŸŒฒ Far away: You see the forest (low resolution)
  • ๐ŸŒณ Closer: You see individual tree species (medium resolution)
  • ๐Ÿƒ Very close: You see oak subspecies and age classes (high resolution)

Part 7: Cell Type Annotation - Putting Names to Faces ๐Ÿท๏ธ

We have clusters, but what ARE they? This is where Decoupler comes in.

Decoupler + PanglaoDB: The Cell Type Detective

markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=fresh_blood_adata, net=markers, tmin=3)
Enter fullscreen mode Exit fullscreen mode

PanglaoDB is a database containing thousands of known cell type markers from published studies. It's like having a library of "wanted posters" for different cell types:

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘      B CELL MARKERS       โ•‘
โ•‘  - CD19  - CD79A          โ•‘
โ•‘  - CD79B - MS4A1          โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

โ•”โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•—
โ•‘      T CELL MARKERS       โ•‘
โ•‘  - CD3D  - CD3E           โ•‘
โ•‘  - CD3G  - CD4            โ•‘
โ•šโ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Enter fullscreen mode Exit fullscreen mode

How Decoupler annotates cells:

  1. For each cell, check how many B cell markers it expresses
  2. Check how many T cell markers it expresses
  3. Check all other cell types
  4. Assign the cell to the type with the highest score
Cell X:
  B cell score:  0.2 โญ
  T cell score:  8.7 โญโญโญโญโญโญโญโญ
  NK cell score: 0.1 โญ

โ†’ Cell X is a T CELL! ๐ŸŽฏ
Enter fullscreen mode Exit fullscreen mode

Statistical Ranking

We go one step further by statistically testing which cell type signature is most enriched in each cluster:

fresh_blood_adata_rank = dc.tl.rankby_group(
    score, 
    groupby="leiden_res0_02", 
    reference="rest", 
    method="t-test_overestim_var"
)
Enter fullscreen mode Exit fullscreen mode

This is like asking: "For each cluster, which cell type signature is significantly higher here than everywhere else?"


Part 8: Validation - Trust, But Verify โœ…

Automated annotation is powerful, but we still need to validate with known marker genes. Let's visualize!

The Dot Plot: Expression at a Glance

sc.pl.dotplot(fresh_blood_adata, marker_genes_dict, "cell_type")
Enter fullscreen mode Exit fullscreen mode

A dot plot shows two things:

  • Dot size: What percentage of cells express this gene?
  • Dot color: How strongly is it expressed?
             CD3D  CD79A  NKG7
T cells       โ—โ—โ—    ยท     ยท
B cells        ยท    โ—โ—โ—    ยท
NK cells       ยท     ยท    โ—โ—โ—

โ— = large, red dot (high expression)
ยท = tiny dot (low/no expression)
Enter fullscreen mode Exit fullscreen mode

If our annotation is correct, we should see:

  • โœ… T cell markers HIGH in T cells, LOW elsewhere
  • โœ… B cell markers HIGH in B cells, LOW elsewhere
  • โœ… NK cell markers HIGH in NK cells, LOW elsewhere

The Violin Plot: Distribution Stories

sc.pl.violin(score, keys=["B cells memory"], groupby="leiden_res0_02")
Enter fullscreen mode Exit fullscreen mode

Violin plots show the full distribution of expression levels:

Cluster 0:  ๐ŸŽป (wide, high median โ†’ B cells!)
Cluster 1:  ๐ŸŽต (narrow, low โ†’ not B cells)
Cluster 2:  ๐ŸŽต (narrow, low โ†’ not B cells)
Enter fullscreen mode Exit fullscreen mode

The Heatmap: The Big Picture

sc.pl.heatmap(fresh_blood_adata, marker_genes_dict, groupby="leiden_res0_02")
Enter fullscreen mode Exit fullscreen mode

A heatmap shows everything at once:

           CD3D CD3E CD79A MS4A1 NKG7 GNLY
Cluster 0: ๐ŸŸฆ๐ŸŸฆ ๐ŸŸจ๐ŸŸจ ๐ŸŸฆ๐ŸŸฆ   (T cells)
Cluster 1: ๐ŸŸฆ๐ŸŸฆ ๐ŸŸจ๐ŸŸจ ๐ŸŸฆ๐ŸŸฆ   (B cells)
Cluster 2: ๐ŸŸฆ๐ŸŸฆ ๐ŸŸฆ๐ŸŸฆ ๐ŸŸจ๐ŸŸจ   (NK cells)

๐ŸŸจ = high expression
๐ŸŸฆ = low expression
Enter fullscreen mode Exit fullscreen mode

Each row should show a distinct pattern, like a unique barcode for each cell type.


๐Ÿงฌ The Biology Behind the Code: What We Discovered

After all this analysis, what did we actually learn about these blood cells?

The Major Players in Your Immune System

๐Ÿ”ต T Cells (CD3+): Your adaptive immune system's quarterbacks

  • Coordinate immune responses
  • Kill infected cells
  • Remember past infections
  • Represent ~70% of lymphocytes in blood

๐ŸŸข B Cells (CD19+, CD79A+): Your antibody factories

  • Produce immunoglobulins
  • Create immunological memory
  • Present antigens to T cells
  • ~10-15% of lymphocytes

๐ŸŸก NK Cells (CD56+, NKG7+): Your innate immune system's assassins

  • Kill virus-infected and tumor cells
  • Don't need prior exposure
  • Rapid response team
  • ~5-15% of lymphocytes

๐Ÿ”ด Monocytes (CD14+, CD16+): Your mobile macrophages

  • Engulf pathogens and debris
  • Differentiate into macrophages and dendritic cells
  • Present antigens
  • ~2-10% of white blood cells

Subtypes and Complexity

Within each major type, we find subtypes:

T Cells:

  • CD4+ Helper T cells (orchestrate immune response)
  • CD8+ Killer T cells (destroy infected cells)
  • Regulatory T cells (prevent autoimmunity)
  • Memory T cells (remember past infections)

B Cells:

  • Naive B cells (haven't encountered antigen yet)
  • Memory B cells (remember past infections)
  • Plasma cells (antibody-producing factories)

This heterogeneity is exactly WHY single-cell analysis is revolutionaryโ€”we'd miss all this complexity in bulk sequencing!


๐Ÿ’ป The Code That Makes It Happen

Let me show you the elegant simplicity of the core workflow:

# 1. Load and QC
adata = sc.read_10x_h5(filename)
sc.pp.calculate_qc_metrics(adata, qc_vars=["MT", "RIBO", "HB"], inplace=True)
sc.pp.filter_cells(adata, min_genes=1000)
sc.pp.filter_genes(adata, min_cells=1000)

# 2. Normalize
adata.layers["counts"] = adata.X.copy()  # Save raw counts
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)

# 3. Feature selection
sc.pp.highly_variable_genes(adata, n_top_genes=1000)

# 4. Dimensionality reduction
sc.tl.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)

# 5. Clustering
sc.tl.leiden(adata, resolution=0.5)

# 6. Cell type annotation
markers = dc.op.resource(name="PanglaoDB", organism="human")
dc.mt.ulm(data=adata, net=markers)

# 7. Visualize
sc.pl.umap(adata, color=["leiden", "cell_type"])
sc.pl.dotplot(adata, marker_genes, "cell_type")
Enter fullscreen mode Exit fullscreen mode

That's it! About 15 lines of code to go from raw data to annotated cell types. Of course, the devil is in the detailsโ€”choosing the right parameters, validating results, interpreting biologyโ€”but the core workflow is remarkably elegant.

๐Ÿ““ Get the Full Code: Check out the complete Jupyter notebook on GitHub with step-by-step explanations and all visualizations!


๐ŸŒŸ Real-World Applications: Why This Matters

1. Cancer Immunotherapy ๐ŸŽ—๏ธ

Single-cell analysis revealed why some patients respond to immunotherapy and others don't:

  • Responders have more "exhausted" T cells that can be reactivated
  • Non-responders lack these cells or have immunosuppressive cells
  • Led to combination therapies targeting specific cell states

2. COVID-19 Severity ๐Ÿฆ 

During the pandemic, scRNA-seq showed:

  • Severe cases had dysregulated interferon responses
  • Certain monocyte subtypes predicted worse outcomes
  • Helped identify therapeutic targets

3. Autoimmune Diseases ๐Ÿ”„

Discovered that:

  • Regulatory T cells are dysfunctional, not absent, in many autoimmune conditions
  • Different autoimmune diseases have distinct cellular signatures
  • Opened new avenues for targeted therapy

4. Developmental Biology ๐Ÿ‘ถ

Mapping how embryonic stem cells become specialized:

  • Identified intermediate cell states
  • Revealed decision points in cell fate
  • Guided strategies for lab-grown tissues

๐Ÿš€ The Future: Where Are We Headed?

Multi-Omics Integration

Imagine measuring not just RNA, but also:

  • Proteins (what's actually doing the work)
  • Chromatin accessibility (which genes CAN be turned on)
  • DNA methylation (epigenetic memory)
  • TCR/BCR sequences (immune receptor repertoire)

All in the SAME cell. This is happening now!

Spatial Transcriptomics

The next frontier: measuring gene expression while preserving WHERE each cell sits in the tissue:

        Traditional scRNA-seq          Spatial

             โ—โ—โ—โ—โ—โ—                   ๐Ÿ”ต๐Ÿ”ต๐ŸŸข๐Ÿ”ต๐Ÿ”ต
             โ—โ—โ—โ—โ—โ—                   ๐Ÿ”ต๐Ÿ”ต๐ŸŸข๐Ÿ”ต๐Ÿ”ต
        (lost spatial info)          ๐ŸŸข๐ŸŸข๐ŸŸข๐ŸŸข๐ŸŸข
                                     ๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด๐Ÿ”ด
                                (preserved positions!)
Enter fullscreen mode Exit fullscreen mode

This reveals:

  • Cell-cell communication networks
  • Tissue architecture
  • Microenvironments
  • Gradients and zones

AI-Powered Analysis

Machine learning models are being trained on millions of cells to:

  • Automatically annotate any cell from any tissue
  • Predict drug responses
  • Identify disease signatures
  • Generate hypotheses

Real-Time Clinical Diagnostics

The dream: rapid scRNA-seq in hospitals:

  1. Patient presents with immune disorder
  2. Draw blood
  3. scRNA-seq analysis in 24 hours
  4. Precise diagnosis and personalized treatment plan

๐Ÿค” The Philosophy: Why Small Things Matter

There's something profound about single-cell analysis. For centuries, we studied populations, averages, bulks. We asked "What does THE cell do?" as if there was one representative of each type.

Single-cell analysis taught us humility. There is no "THE cell." Every cell is an individual with its own history, state, and destiny. The T cell next to its sibling might be:

  • Responding to infection while its neighbor rests
  • Expressing different genes despite identical DNA
  • On a path to memory while its twin approaches exhaustion

This is true for cells, and perhaps, it's true for life. We are not averages. We are distributions. We are heterogeneous. And that heterogeneity is not noiseโ€”it IS the biology.

When we measure 10,000 cells individually, we're not just generating data. We're acknowledging that:

  • Rare cells matter (the 0.1% might save your life)
  • Transitions matter (cells in-between states reveal mechanisms)
  • Variation matters (diversity enables adaptation)

๐ŸŽ“ Getting Started: Your Own Single-Cell Journey

Want to try this yourself? Here's your roadmap:

1. Learn Python Basics (2-4 weeks)

  • Variables, loops, functions
  • Pandas for data manipulation
  • Matplotlib for visualization

2. Understand the Biology (ongoing)

  • Cell biology fundamentals
  • Immunology basics (if analyzing immune cells)
  • Read reviews on scRNA-seq

3. Work Through Tutorials (4-8 weeks)

4. Analyze Public Data (2-3 months)

  • 10x Genomics public datasets
  • Human Cell Atlas
  • GEO database

5. Design Your Own Experiment (โˆž)

  • Define biological question
  • Plan experimental design
  • Collaborate with experts

๐Ÿ“š Resources

Essential Reading:

Datasets to Practice:

Communities:


๐ŸŽผ Final Thoughts: The Symphony of Life

When I first looked at a UMAP plot of 10,000 cells, each dot representing an individual cell, I was struck by the beauty of it. These aren't just data points. Each one is a living entity with:

  • Billions of proteins working in concert
  • Thousands of molecular machines running simultaneously
  • A complete copy of your genome deciding what to become
  • A past (where it came from) and future (what it will become)

And we can listen to each one's unique molecular signatureโ€”the genes it chooses to express at this moment in time.

It's like conducting a 10,000-piece orchestra where:

  • Every musician plays their own score
  • The music changes moment by moment
  • Patterns emerge from seeming chaos
  • And somehow, together, they create the symphony of life

That's what single-cell RNA sequencing lets us do: listen to the individual voices while appreciating the collective harmony.

Your immune systemโ€”right now, as you read thisโ€”contains billions of these cells. Each one performing its role, some fighting infections you'll never know you had, some remembering that cold from three years ago, some just quietly waiting for their moment to act.

We're just beginning to learn their stories. And with every cell we sequence, every algorithm we refine, every biological insight we gain, we get a little closer to understanding the fundamental question:

What does it mean to be alive, one cell at a time?


๐Ÿ™ Acknowledgments & Further Exploration

This analysis was performed using:

  • HackBio Intership Analysis
  • Scanpy (Wolf et al., 2018)
  • AnnData data structure
  • Decoupler (Badia-i-Mompel et al., 2022)
  • PanglaoDB marker database
  • 10x Genomics technology and datasets

๐Ÿ““ Access the Full Tutorial

The complete Jupyter notebook with all code, visualizations, and detailed explanations is available on GitHub:

View the Human Diseased PBMC Notebook on GitHub

What's included:

  • โœ… Complete analysis pipeline
  • โœ… All code cells with explanations
  • โœ… Quality control visualizations
  • โœ… Clustering and annotation workflows
  • โœ… Interactive plots and heatmaps

๐ŸŒ Connect & Learn

  • ๐Ÿ™ GitHub Repository
  • ๐Ÿ’ฌ Questions? Drop a comment below!
  • ๐Ÿ”” Follow for more bioinformatics tutorials
  • ๐Ÿ“ง Reach out for collaborations

"In the world of single-cell analysis, every cell has a story. Our job is to listen."


Top comments (1)

Collapse
 
geovalexis profile image
Geo

Great post!