DEV Community: Farhan Rehman Sherief

How I redesigned a thermostable enzyme using ProteinMPNN inverse folding - and validated every design with AlphaFold2

Farhan Rehman Sherief — Wed, 01 Jul 2026 21:46:56 +0000

E155 and E215 at 6.03 Å - exactly the nucleophile/acid-base separation
expected for a GH5 retaining endoglucanase.

This step matters. If I had accepted the literature numbering without checking, I would have fixed the wrong residues and the constrained run would be biologically
meaningless.

Step 2: B-factor profile

Before running ProteinMPNN, I computed per-residue B-factors to identify
flexibility hotspots - regions that might benefit most from redesign:

High-flexibility regions (B > 30 Å²):

Residues 246–250: B-factor up to 96.85 Å² - extremely mobile surface loop
N-terminus (residues 1–3)
Loop 169–173

These are the regions where ProteinMPNN has the most freedom and where
thermostabilising mutations would be most impactful in a follow-up study.

Step 3: ProteinMPNN - unconstrained run

First I ran ProteinMPNN without any constraints to understand what sequences it naturally prefers for this backbone:

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --out_folder mpnn_output/temp_0.1 \
    --num_seq_per_target 100 \
    --sampling_temp 0.1 \
    --seed 42

Five temperatures (0.1, 0.2, 0.3, 0.5, 0.8), 100 sequences each = 500 total.

The result was surprising: at T=0.1, ProteinMPNN placed Threonine at E155
(93/100 sequences) and Alanine at E215 (97/100 sequences). Both catalytic glutamates were replaced with non-catalytic residues.

This is not a bug - it's the correct behaviour. ProteinMPNN optimises for backbone fit, not biological function. Threonine and alanine may pack better against the local structure than glutamate, but they cannot perform the retaining mechanism. This finding motivates the constrained run.

Step 4: ProteinMPNN - catalytic-constrained run

Fix E155 and E215, let everything else vary:

fixed_positions = {
    "cel5a_clean": {
        "A": [155, 215]
    }
}
with open("fixed_positions.jsonl", "w") as f:
    f.write(json.dumps(fixed_positions) + "\n")

python protein_mpnn_run.py \
    --pdb_path cel5a_clean.pdb \
    --fixed_positions_jsonl fixed_positions.jsonl \
    --num_seq_per_target 100 \
    --sampling_temp 0.1

Results:

Temperature	Mean score	Mean recovery	E155 preserved	E215 preserved
0.1	0.763	52.6%	100%	100%
0.2	0.788	52.1%	100%	100%
0.3	0.830	51.5%	100%	100%
0.5	0.983	49.1%	100%	100%
0.8	1.369	43.1%	100%	100%

100% catalytic preservation at all temperatures. And the score distributions are nearly identical to the unconstrained run - fixing 2 out of 605 residues costs essentially nothing in terms of backbone fit.

Step 5: AlphaFold2 validation

Top 20 constrained designs (lowest MPNN score, T=0.1) were folded using
ColabFold on a free T4 GPU:

colabfold_batch \
  top20_constrained_candidates.fasta \
  af2_structures/designs \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --msa-mode single_sequence

Also folded the wildtype under identical conditions as a baseline.

Result: 20/20 designs beat wildtype pLDDT.

Design	MPNN score	AF2 pLDDT	ΔpLDDT vs WT
CEL5A_FIXED_013	0.7532	46.04	+12.71
CEL5A_FIXED_012	0.7513	45.69	+12.37
CEL5A_FIXED_014	0.7535	45.61	+12.29
CEL5A_FIXED_010	0.7508	44.48	+11.16
CEL5A_FIXED_004	0.7471	43.72	+10.40

The per-residue pLDDT profiles show the biggest improvements in the catalytic domain region (residues 150–250) - exactly where the constrained redesign introduced the most changes around the fixed glutamates.

A note on absolute pLDDT values

The pLDDT values (33–46) look low compared to typical small protein benchmarks. This is expected for a 605 aa two-domain protein in single-sequence mode - AlphaFold2 relies heavily on co-evolutionary information from MSAs to correctly position domains. Single-sequence mode lacks this signal.

The meaningful comparison is relative pLDDT under identical conditions, not absolute values. Every design predicts better than wildtype under the same constraints.

What I'd do next

MSA-mode validation for top 5 designs - proper pLDDT with full co-evolutionary information
Rosetta ΔΔG scoring - filter by predicted thermostability change
Molecular dynamics - simulate the top 3 designs at 55°C and 70°C to assess thermal stability of the catalytic triad geometry
Experimental validation - express in E. coli, measure Tm by DSF, compare CMC activity at elevated temperatures

Key lessons

1. Verify catalytic residues experimentally, not from literature.
PDB numbering often differs from publication numbering. The 6 Å distance
criterion is more reliable than assuming the literature values transfer directly.

2. Run unconstrained first.
The unconstrained run revealed that ProteinMPNN actively avoids glutamate at both catalytic positions. Without this finding, the constrained run would lack motivation and the project would have less scientific narrative.

3. Fixing 2/605 residues is essentially free.
Score distributions between constrained and unconstrained runs are almost
identical. You can enforce catalytic function without sacrificing sequence
diversity.

4. Low absolute pLDDT is not failure.
Single-sequence mode for large multi-domain proteins always yields low absolute pLDDT. Design relative comparisons, and always fold the wildtype under the same conditions as your baseline.

GitHub: github.com/Farhan89082/proteinmpnn-cel5a

If you have questions about any stage - especially the catalytic residue
identification or the constrained run setup - drop them in the comments.

How I built a computational AMP screening pipeline: from 24,000 sequences to 47 drug candidates

Farhan Rehman Sherief — Wed, 01 Jul 2026 10:47:02 +0000

Antimicrobial resistance is projected to kill 10 million people annually by 2050 - more than cancer. Antimicrobial peptides (AMPs) are one of the most promising therapeutic classes to fight it. They're cationic, membrane-disrupting, and bacteria struggle to develop resistance against them.

The problem: there are thousands of candidate sequences in databases. How do you identify the ones worth synthesising?

In this post I'll walk through a full computational screening pipeline I built from raw database download through machine learning activity prediction to ColabFold structure prediction - ending with 47 high-confidence drug candidates.

The full code is on GitHub: amp-colabfold

The pipeline at a glance

Five stages:

Data curation - 24,076 sequences from DRAMP 4.0 down to 12,435 after filtering
Feature engineering - 16 physicochemical features, 0% missing
Activity modelling - gradient boosting classifier, ROC-AUC 0.77
Candidate selection - 131 structured peptides sent to ColabFold
Structure prediction - 69 high-confidence structures, 47 very high confidence

Stage 1: Data curation

I pulled sequences from DRAMP 4.0 (the Database of Antimicrobial Peptides)
using their REST API - no manual downloads needed.

Three filters applied in sequence:

# Length: 10-50 aa (therapeutic window for AMPs)
df = df[df["sequence"].str.len().between(10, 50)]

# Canonical amino acids only - no modified residues
CANONICAL = set("ACDEFGHIKLMNPQRSTVWY")
df = df[df["sequence"].apply(lambda s: set(s).issubset(CANONICAL))]

# 90% identity clustering — Python reimplementation of CD-HIT
# (CD-HIT is Linux-only; I wrote a greedy longest-first algorithm in pure Python)
df_curated = cluster_cdhit_python(df, identity=0.90)

The CD-HIT reimplementation was the most interesting part of this stage.
CD-HIT's greedy algorithm is conceptually simple: sort sequences longest-first, then for each sequence check if it's ≥90% identical to any existing cluster representative. If yes, discard it. If no, it becomes a new representative. I added 3-mer pre-filtering to skip obviously dissimilar pairs, matching CD-HIT's default behaviour.

Result: 24,076 → 12,435 sequences

Stage 2: Feature engineering

For each of the 12,435 sequences I computed 16 physicochemical features using
the peptides Python package:

features = {
    "net_charge_pH7":           pep.charge(pH=7.0, pKscale="Lehninger"),
    "hydrophobicity_eisenberg":  pep.hydrophobicity(scale="Eisenberg"),
    "hydrophobic_moment":        pep.hydrophobic_moment(window=11, angle=100),
    "boman_index":               pep.boman(),
    "aliphatic_index":           pep.aliphatic_index(),
    "instability_index":         pep.instability_index(),
    # + Chou-Fasman helix/sheet/turn propensities computed manually
    # + aromaticity, fraction_positive, fraction_negative, length, pI, MW
}

One methodological note: aromaticity() was removed from newer versions of
peptides, so I computed it directly from residue frequencies
(F + Y + W) / length — same definition as Biopython, just implemented inline.

Result: 16 features, 0% missing across all 12,435 sequences

Stage 3: Activity modelling

Binary classification: antibacterial (confirmed MIC data) vs
general/natural (uncharacterised or broad labels).

clf = GradientBoostingClassifier(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=3,        # shallow trees → interpretable SHAP values
    subsample=0.8,
    random_state=42,
)

ROC-AUC: 0.77 on a length-decile-stratified holdout set.

The class imbalance (28.9% positive) is real and intentional — confirmed
antibacterial sequences with MIC data are rarer than general database entries.

I didn't oversample, because the model's conservative predictions are
appropriate for a screening tool: better to miss some true positives than
flood the downstream ColabFold stage with false ones.

What drives the predictions?

The SHAP beeswarm tells a clean biological story:

Length dominates - longer peptides push positive. This reflects the antibacterial subset containing more full-length defensins and cathelicidins.
fraction_helix - high helix propensity → predicted antibacterial. Consistent with α-helical membrane disruption.
fraction_positive - cationic residues push positive. Textbook electrostatic attraction to anionic bacterial membranes.
fraction_negative - anionic residues suppress prediction. Expected.

This is not a black box. Every prediction is traceable to a physical property with a known biological mechanism. That's the point of using shallow trees with SHAP rather than a deep model.

Stage 4: Candidate selection

From the classifier output I selected peptides with ab_proba ≥ 0.938 and
then applied a low-complexity filter:

def is_low_complexity(seq: str, threshold: float = 0.5) -> bool:
    # Flag homopolymers (KKKKKK, RRRRRR) and tandem repeats
    max_freq = max(seq.count(aa) for aa in set(seq)) / len(seq)
    if max_freq > threshold:
        return True
    for i in range(len(seq) - 2):
        kmer = seq[i:i+3]
        if seq.count(kmer) * 3 / len(seq) > 0.6:
            return True
    return False

This step is important. Homopolymeric sequences (poly-K, poly-R) score high
on fraction_positive and fool the classifier - they're real AMPs but
structurally trivial and not useful for ColabFold. The filter removed 44
low-complexity sequences.

Result: 131 structured candidates (15–44 aa) sent to ColabFold

Stage 5: Structure prediction with ColabFold

I ran ColabFold 1.6.1 on Google Colab (free T4 GPU) in single-sequence mode appropriate for short peptides where MSA depth is limited anyway.

colabfold_batch \
  colabfold_input.fasta \
  colabfold_output/ \
  --model-type alphafold2_ptm \
  --num-recycle 3 \
  --num-models 1 \
  --msa-mode single_sequence

For 131 peptides of 15-44 aa, the full run took ~15 minutes on a T4.

Parsing results locally

def parse_scores_json(json_path):
    with open(json_path) as f:
        data = json.load(f)
    return {
        "mean_plddt": float(np.mean(data["plddt"])),
        "ptm":        float(data["ptm"]),
        "plddt_per_residue": data["plddt"],
    }

pLDDT distribution

69 of 131 candidates (52.7%) have pLDDT ≥ 70, indicating meaningful structural confidence. For short peptides - which are inherently more disordered than globular proteins - this is a strong result.

Structural confidence vs activity prediction

The scatter shows no strong correlation between classifier confidence and
structural confidence - which is biologically expected. A peptide can be
highly predicted antibacterial (high ab_proba) but disordered in solution
(low pLDDT), because many AMPs adopt structure only upon membrane contact.

Final results

Applying both filters - pLDDT ≥ 90 AND ab_proba ≥ 0.95 - gives
47 very high-confidence candidates:

AMP ID	Length	pLDDT	pTM	ab_proba
AMP_02388	30 aa	97.3	0.54	0.952
AMP_01454	36 aa	97.2	0.60	0.985
AMP_02689	28 aa	97.2	0.50	0.954
AMP_01282	37 aa	97.1	0.61	0.981
AMP_01129	38 aa	97.0	0.62	0.959

Full list: candidate_amps.csv

What I'd do next

Hemolysis screening - filter candidates through HemoPI2 to remove cytotoxic sequences before any wet-lab work
ESM-2 embeddings - train a second classifier on sequence embeddings and compare against the physicochemical feature model
MD simulation - for the top 5 candidates, run short molecular dynamics to assess membrane-binding stability
Experimental validation - MIC assays against E. coli and S. aureus

Key lessons

1. Reproducibility over convenience. The CD-HIT reimplementation in pure Python adds ~100 lines but means the entire pipeline runs on any OS without binary dependencies.

2. Class imbalance is a feature, not a bug. The 3:1 negative:positive
ratio reflects reality. Don't oversample your way out of it.

3. Interpret your model before trusting it. The SHAP beeswarm told me
the model was learning real biology (charge, helix propensity) rather than
spurious correlations. Without that check I wouldn't have been confident
sending its predictions to a GPU.

4. pLDDT ≠ activity. Low structural confidence doesn't mean an AMP is
inactive - it may just be disordered until it hits a membrane. Keep both
metrics and let them inform each other.

GitHub: github.com/Farhan89082/amp-colabfold

If you found this useful or have questions about any stage, drop a comment below.

How I Used Python to Analyse 40,000 Human Gut Cells and Uncover What Makes Crohn's Disease Different from Colitis

Farhan Rehman Sherief — Sun, 07 Jun 2026 10:52:39 +0000

A step-by-step walkthrough of my multi-sample single-cell RNA sequencing project - written for anyone curious about how computational biology actually works

Before we start - what is this article actually about?

Imagine being able to take a tiny sample of tissue from a patient's gut, and instead of just knowing "there are cells here", you could read the activity of every single gene inside every single individual cell - thousands of cells at once.

That's what single-cell RNA sequencing (scRNA-seq) does. And in this article, I'll walk you through how I used Python to analyse data from 40,000 human gut cells across 18 patients to understand what makes two similar gut diseases - Crohn's disease and ulcerative colitis - biologically different from each other.

No biology PhD required. I'll explain every term as we go.

The biological problem - two diseases that look the same but aren't

Crohn's disease (CD) and ulcerative colitis (UC) are both types of Inflammatory Bowel Disease (IBD). Both cause chronic inflammation in the gut, both cause pain and discomfort, and both are lifelong conditions. Doctors have known for decades that they're different diseases, but at a surface level they're easy to confuse.

The key difference is where and how they inflame:

Crohn's disease can affect any part of the digestive tract, tends to cause deep, patchy inflammation, and is driven largely by a type of immune cell called a myeloid cell (think macrophages - the "pac-man" cells of the immune system)
Ulcerative colitis affects only the colon and rectum, causes continuous surface inflammation, and is driven more by antibody-producing cells called plasma cells

Understanding these differences at the level of individual cells - not just tissue - is critical for developing better, more targeted treatments. That's where single-cell sequencing comes in.

The data - 18 patients, 40,000 cells, one big challenge

I used a publicly available dataset from the CZ CELLxGENE platform containing colonic mucosa (colon lining) biopsies from 18 donors:

6 patients with Crohn's disease
6 patients with ulcerative colitis
6 healthy controls

Each donor contributed thousands of cells, giving us 46,700 cells total (40,084 after quality filtering).

But here's the problem: when you combine data from 18 different people, collected at different times, processed slightly differently in the lab - the data gets messy. The technical differences between samples (called batch effects) can be so strong that they hide the real biological differences you actually care about.

Think of it like trying to compare photos taken by 18 different cameras with different colour settings. The same object might look completely different just because of the camera, not because the object actually changed.

This is where Harmony comes in.

What is Harmony and why does it matter?

Harmony is a batch correction algorithm. In plain English: it's a mathematical tool that looks at all 18 samples, figures out which differences between samples are due to technical variation (the "camera settings"), and removes those differences — leaving only the real biological signal.

Here's what the data looks like before Harmony:

PCA plot coloured by donor - you can see individual donors clustering separately, meaning donor identity is driving the structure more than the actual biology.

And after Harmony:

UMAP plot coloured by donor - all 18 donors are now mixed uniformly within each cell type cluster. The batch effects are gone.

But crucially - when we colour the same post-Harmony UMAP by disease group instead of donor:

UMAP coloured by disease - Crohn's, UC, and normal cells now separate into distinct regions. The biology is preserved, the noise is removed.

This before/after comparison is the core technical contribution of this project. Without Harmony, any findings could be explained by "oh, that's just because donor 3 was processed differently." With Harmony, we can be confident the differences are real.

Setting up the pipeline in Python

The full analysis uses Scanpy - the standard Python library for single-cell analysis - along with harmonypy for the integration step.

import scanpy as sc
import harmonypy

# Load the dataset
adata = sc.read_h5ad('data/ibd_dataset.h5ad')
# 46,700 cells × 32,354 genes

The data is stored in an AnnData object - think of it as a very specialised spreadsheet where rows are cells, columns are genes, and there's extra space to store metadata like disease status, donor ID, and cell type labels.

Step 1 - Quality control

Not every cell in a sequencing experiment is a real, healthy cell. Some are damaged, some are empty droplets that got accidentally captured, and some are doublets (two cells mistakenly counted as one). We filter these out using three metrics:

# Remove low quality cells
adata = adata[adata.obs['n_genes_by_counts'] > 200]   # too few genes = empty droplet
adata = adata[adata.obs['n_genes_by_counts'] < 6000]  # too many genes = doublet
adata = adata[adata.obs['pct_counts_mt'] < 30]        # high mitochondrial % = dying cell

Why mitochondrial genes? When a cell is dying or stressed, the nucleus breaks down and releases its RNA - but mitochondria (the cell's energy factories) have their own separate DNA and RNA that stays intact longer. So a high percentage of mitochondrial gene reads is a reliable sign of a low-quality cell.

After filtering: 40,084 cells remain.

Step 2 - Normalisation

Different cells capture different amounts of RNA simply due to technical variation in the sequencing process. A cell with 10,000 RNA molecules captured will look like it expresses every gene more than a cell with only 2,000 captured — even if their true biology is identical.

We normalise by scaling every cell to have the same total count (10,000), then apply a log transformation to reduce the influence of very highly expressed genes:

sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)

Step 3 - Finding highly variable genes

Out of 32,354 genes, most are either not expressed at all or expressed at the same level in every cell (housekeeping genes that keep basic cell functions running). These genes add noise without adding information.

We select only the highly variable genes - genes whose expression varies meaningfully between cells - for downstream analysis. We found 2,873 of these, using batch_key='donor_id' to ensure we pick genes that are variable across all donors, not just one:

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3,
                             min_disp=0.5, batch_key='donor_id')
# Result: 2,873 highly variable genes

Step 4 - PCA and Harmony integration

PCA (Principal Component Analysis) reduces our 2,873-gene matrix into 50 dimensions that capture the most important variation. Then Harmony corrects for batch effects within this PCA space:

sc.tl.pca(adata, svd_solver='arpack')
sc.external.pp.harmony_integrate(adata, key='donor_id', max_iter_harmony=20)
# Harmony converged in just 6 iterations

The fact that Harmony converged in only 6 out of 20 possible iterations is a good sign - it means the batch effects were relatively structured and correctable, and the biological signal is strong.

Step 5 - UMAP visualisation

UMAP (Uniform Manifold Approximation and Projection) takes the Harmony-corrected embeddings and projects them into 2D for visualisation. Similar cells end up close together, different cells end up far apart:

sc.pp.neighbors(adata, use_rep='X_pca_harmony', n_neighbors=10)
sc.tl.umap(adata)

What we found - the biology

Finding 1: Five major cell populations in the gut

The UMAP revealed five well-separated clusters corresponding to the major cell types in the colon:

Plasma cells (15,633 cells) - antibody factories
Colon epithelial cells (12,347 cells) - the cells lining the gut wall
T cells (12,128 cells) - immune soldiers
Myeloid cells (3,771 cells) - macrophages and related immune cells
Stromal cells (2,821 cells) - structural support cells

Finding 2: UC and CD have completely different cellular makeups

This is where the biology gets interesting. When we look at what proportion of cells belong to each type across disease groups:

Plasma cells:

Normal: ~27%
Crohn's disease: ~30%
Ulcerative colitis: ~52%

UC has nearly double the plasma cell proportion of healthy tissue. Plasma cells make antibodies - this confirms that UC is primarily driven by antibody-mediated (humoral) immunity, a well-established finding that we independently reproduced from raw data.

Myeloid cells:

Normal: ~6%
Ulcerative colitis: ~8%
Crohn's disease: ~13%

CD has more than double the myeloid cell proportion. Myeloid cells include macrophages - the cells responsible for the granulomatous (nodule-forming) inflammation that's the hallmark of Crohn's disease.

Epithelial cells:

Normal: ~29%
Crohn's disease: ~19%
Ulcerative colitis: ~3%

This is striking. UC patients have almost no epithelial cells left - the gut lining is severely disrupted. This explains why UC causes such pronounced mucosal damage and bleeding.

Finding 3: The S100A8/S100A9 signature - a clinical biomarker reproduced at single-cell resolution

To find which specific genes are driving the myeloid difference in Crohn's disease, we ran a differential expression analysis comparing CD myeloid cells against normal myeloid cells using the Wilcoxon rank-sum test.

The top upregulated genes were:

Gene	What it does
S100A8	Subunit of calprotectin - a protein released by activated immune cells
S100A9	The other subunit of calprotectin
CXCL8	Also known as IL-8 - a chemical signal that recruits more immune cells to the inflammation site
IL1RN	An anti-inflammatory signal - the body trying to dampen its own response
BCL2A1	Keeps immune cells alive longer in the inflamed environment

Here's why S100A8 and S100A9 matter: together they form a protein called calprotectin. When gut inflammation is active, immune cells release calprotectin into the stool. Doctors measure this in a routine test called the faecal calprotectin test - one of the most common non-invasive ways to monitor IBD disease activity in clinic.

By analysing raw single-cell data, we independently identified the exact genes behind this clinical test - at the resolution of individual cells. That's the kind of validation that confirms the analysis is biologically meaningful, not just a statistical artefact.

What I learned - technical takeaways

1. Batch correction is non-negotiable for multi-sample studies. Without Harmony, the donor-to-donor variation would swamp the disease signal. Any "finding" could be explained by technical noise. Harmony is now a standard step in any multi-patient single-cell study.

2. Choosing highly variable genes with batch_key matters. If you find HVGs without accounting for batches, you risk selecting genes that are variable only because of technical differences between samples. Using batch_key='donor_id' ensures you're finding genes that are genuinely biologically variable.

3. Clinical relevance is the best validation. When your computational analysis independently reproduces a well-established clinical biomarker (faecal calprotectin), it gives you confidence that the pipeline is working correctly and the findings are real.

4. Public datasets are powerful. This entire analysis used freely available data from CZ CELLxGENE - no lab access required. The tools (Scanpy, harmonypy) are free and open source. Computational biology has an exceptionally low barrier to entry compared to wet lab science.

The full code

The complete annotated notebook is available on GitHub:
👉 github.com/Farhan89082/ibd-harmony-integration

The README includes all figures, biological interpretation, and instructions for reproducing the analysis from scratch.

What's next?

This project is part of a series of three single-cell RNA sequencing analyses I've built for my computational biology portfolio:

Alzheimer's Disease - microglial activation and mitochondrial dysfunction in human brain cells → github.com/Farhan89082/alzheimers-scrna-analysis
NSCLC Tumour Microenvironment - T cell exhaustion trajectories and macrophage polarisation in lung cancer → github.com/Farhan89082/nsclc-tumour-microenvironment
IBD Harmony Integration - this article

If you're a Python developer curious about getting into computational biology, or a biology student learning to code, I hope this walkthrough shows that the barrier is lower than it looks. The tools are excellent, the data is freely available, and the biology is genuinely fascinating.

Have questions about the analysis or want to discuss the methodology? Drop a comment below - I'd love to hear from you.

T Cells, Tumour Macrophages, and Why Lung Cancer Evades Your Immune System

Farhan Rehman Sherief — Sat, 06 Jun 2026 18:54:14 +0000

One of the most frustrating puzzles in cancer biology: some lung cancer patients respond brilliantly to immunotherapy. Others don't respond at all. The tumour microenvironment (TME), the ecosystem of immune, stromal, and cancer cells that surrounds a tumour is a big part of why.

I wanted to understand what that ecosystem actually looks like at the resolution of individual cells. So I built a single-cell RNA sequencing (scRNA-seq) analysis of non-small cell lung cancer (NSCLC) tissue using Scanpy, working with a publicly available dataset of 111,683 cells from CZ CELLxGENE, spanning 38 distinct cell states.

Here's what the data revealed about T cell exhaustion, macrophage behaviour, and what makes some tumours more immune-suppressive than others.

Background

NSCLC accounts for approximately 85% of all lung cancer cases. Immune checkpoint inhibitors have transformed treatment for some patients — but a large proportion still don't respond. The leading explanation is that the TME is actively suppressing anti-tumour immunity, rather than the immune system simply being absent.

To understand this, you need to look not just at whether immune cells are present, but at what functional state those cells are in. That's exactly what single-cell RNA sequencing makes possible.

The Dataset

The data comes from a study characterising the cellular and molecular identities of histologic subtypes in lung adenocarcinoma:

111,683 cells after QC (117,266 raw)
57,398 genes
38 distinct cell states across the TME
4 histologic subtypes: Acinar/Papillary (A/P), A/P + Solid, Micropapillary (MP), and Solid
Sequenced with 10× Chromium 3' v3

The Pipeline

1. Quality Control

Cells were filtered to retain those with 200–8,000 detected genes, fewer than 100,000 UMI counts, and less than 25% mitochondrial content. This removed around 5,583 low-quality cells.

2. Normalisation

Standard scRNA-seq approach: normalise to 10,000 counts per cell, log1p transform, identify 2,544 highly variable genes.

3. Dimensionality Reduction

PCA (50 components) → neighbourhood graph → UMAP. The resulting embedding resolved all 38 cell states, with clear separation between tumour epithelial cells, immune populations, and stromal cells.

The Findings

1. T Cell Exhaustion Is Not an End State - It's a Trajectory

This was the most biologically interesting result. The CD8+ T cell data revealed three functionally distinct states forming a clear exhaustion trajectory:

T.CD8.Naive → T.CD8.Predysfunc (8,897 cells) → T.CD8.Exhausted (1,454 cells)

The pre-dysfunctional population is the largest CD8 group - meaning the majority of cytotoxic T cells in these tumours are actively being pushed toward dysfunction, not yet exhausted but clearly losing their killing capacity.

The fully exhausted population co-expresses multiple immune checkpoint molecules simultaneously: CTLA4, LAG3, TIGIT, TOX, ENTPD1, and CXCL13. In contrast, cytotoxic T cells express effector molecules (GZMB, PRF1, IFNG) with minimal checkpoint expression. The dotplot makes this contrast striking and clear.

2. Regulatory T Cells Outnumber Cytotoxic T Cells ~4:1

Tregs (6,496 cells) vastly outnumber cytotoxic CD8+ T cells (1,688 cells). Tregs express high levels of CTLA4 and TOX, consistent with their role in actively suppressing anti-tumour immunity. This ratio alone explains a great deal about why these tumours resist immune attack.

3. Pro-Tumour Macrophages Dominate the Myeloid Compartment

Five macrophage subpopulations were identified:

Mac.SPP1 and Mac.SPP1.GPNMB (~7,054 cells combined) - pro-tumourigenic, SPP1/osteopontin-expressing macrophages associated with poor prognosis in multiple cancer types
Mac.SELENOP (3,082 cells) - anti-inflammatory, tissue-resident
Mac.CXCL9 (1,118 cells) - anti-tumour, recruits cytotoxic T cells

The pro-tumour to anti-tumour macrophage ratio is approximately 6:1. Combined with the T cell exhaustion data, this paints a picture of a TME that is actively - and efficiently - suppressing immune responses at multiple levels simultaneously.

4. The Paradox of Solid Tumours

Solid histologic subtype tumours had the highest immune infiltration (~75% immune cells vs ~63% in acinar/papillary). But solid subtype is also the most aggressive histology.

This counterintuitive finding is consistent with what's sometimes called immune suppression rather than immune exclusion: more immune cells arrive, but the suppressive environment is so potent that they can't function. High infiltration doesn't automatically mean effective anti-tumour immunity.

What This Means for Immunotherapy

The data points to a few specific vulnerabilities worth thinking about:

The large pre-dysfunctional T cell population is a potential therapeutic opportunity - these cells aren't fully exhausted yet and might respond to checkpoint blockade
The 6:1 pro-tumour macrophage ratio suggests that myeloid reprogramming strategies (not just T cell-targeting therapies) may be needed
The Treg suppression appears to work partly through CTLA4, which is why anti-CTLA4 therapy has shown activity in some NSCLC settings

What I Learned

Subsetting for focused analysis pays off. Rather than analysing all 38 cell states at once, isolating the T cell and macrophage compartments separately and running marker analysis on those subsets gave much cleaner, more interpretable results.
Using the authors' cell type annotations directly was the right call. For a dataset this complex, re-clustering from scratch would have introduced unnecessary uncertainty. Their labels are validated and biologically grounded.
With 38 cell states and multiple findings, deciding what to prioritise in the write-up took as long as some analysis steps. Storytelling is part of the work.

The Code

Everything is in a fully annotated Jupyter Notebook. Download the H5AD file from CZ CELLxGENE and place it in the data/ folder to run it.

Farhan89082 / nsclc-tumour-microenvironment

Single-cell RNA-seq analysis of the NSCLC tumour microenvironment - T cell exhaustion trajectories and macrophage polarisation in lung adenocarcinoma

🫁 scRNA-seq Tumour Microenvironment Analysis: Mapping Immune Cell Infiltration in Non-Small Cell Lung Cancer

📌 Background

Non-small cell lung cancer (NSCLC) is the most common form of lung cancer, accounting for approximately 85% of all cases. Despite the success of immune checkpoint inhibitors, a significant proportion of patients do not respond to immunotherapy — a failure largely attributed to T cell exhaustion and immunosuppressive remodelling of the tumour microenvironment (TME).

The TME is a complex ecosystem of tumour cells, immune cells, and stromal cells that collectively determine whether the immune system can mount an effective anti-tumour response. Understanding the cellular composition and functional states within the TME is critical for identifying new therapeutic targets and predicting immunotherapy response.

This project performs a comprehensive single-cell RNA sequencing (scRNA-seq) analysis of the NSCLC tumour microenvironment, profiling 111,683 cells across 38 distinct cell states to characterise T cell exhaustion trajectories, macrophage polarisation states…

View on GitHub

The TME is one of the most complex biological systems we can now study at single-cell resolution - and it's increasingly clear that understanding it is key to making immunotherapy work for more patients. Happy to discuss the analysis, the biology, or the pipeline in the comments.

How I Mapped Brain Cell Changes in Alzheimer's Disease Using Single-Cell RNA Sequencing

Farhan Rehman Sherief — Sat, 06 Jun 2026 18:39:29 +0000

Alzheimer's disease affects over 55 million people worldwide, yet the precise molecular changes happening inside individual brain cells remain poorly understood. I wanted to dig into that question - not at the tissue level, but at single-cell resolution.

So I built a full scRNA-seq analysis pipeline in Python using Scanpy, working with a publicly available dataset of 63,608 nuclei from human prefrontal cortex tissue (sourced from CZ CELLxGENE). The donors spanned three Braak stages: 0 (cognitively normal), 2 (early Alzheimer's), and 6 (severe Alzheimer's).

Here's what I found and how I found it.

The Dataset

The data came from a study on the molecular characterisation of selectively vulnerable neurons in AD. It covers the superior frontal gyrus, a prefrontal region known to be hit hard by neurodegeneration - and includes seven major brain cell types:

Glutamatergic neurons
GABAergic neurons
Oligodendrocytes
OPCs (oligodendrocyte precursor cells)
Astrocytes
Microglia
Endothelial cells

31,997 genes. 63,608 cells. Three disease stages. A lot to work with.

The Pipeline

1. Quality Control

No dataset is clean out of the box. I filtered cells to keep only those with between 200 and 6,000 detected genes, and excluded anything with more than 20% mitochondrial gene content (high mitochondrial reads usually signal a dying or damaged cell). This removed around 2,809 low-quality cells.

2. Normalisation

Library sizes were normalised to 10,000 counts per cell, followed by log1p transformation, standard practice that makes cells comparable regardless of how deeply they were sequenced. I then identified 5,607 highly variable genes to focus the downstream analysis.

3. Dimensionality Reduction

PCA (50 components) → neighbourhood graph (10 neighbours, 20 PCs) → UMAP embedding.

The UMAP is where the biology starts to become visible. All seven cell types separated into distinct clusters, with clear separation between neuronal subtypes and glial populations.

4. Differential Expression

For the microglial analysis, I used a Wilcoxon rank-sum test comparing AD vs normal microglia, with Benjamini-Hochberg multiple testing correction to control the false discovery rate.

The Findings

Glutamatergic Neurons Are Selectively Depleted

One of the most striking results: glutamatergic (excitatory) neurons dropped from ~34% of cells in normal tissue to ~30% in AD tissue. This might sound like a small shift, but at the scale of 60,000+ cells it's biologically meaningful and it's consistent with what the literature already tells us about the selective vulnerability of excitatory neurons in AD.

Alzheimer's Leaves a Clear Signature in Microglia

Microglia are the brain's resident immune cells, and they showed the most dramatic transcriptomic shifts between AD and normal tissue. The differential expression analysis revealed:

Upregulated in AD microglia:

MALAT1 - a long non-coding RNA strongly linked to neuroinflammation
FTH1 - ferritin heavy chain, pointing to iron dysregulation
B2M - beta-2 microglobulin, a known AD biomarker reflecting immune activation
FOXP1 - a transcription factor tied to microglial activation states

Downregulated in AD microglia:

MT-CO3, MT-CO1, MT-ATP6, MT-ND2 - mitochondrial complex genes, suggesting impaired energy metabolism in AD-affected microglia

This pattern is consistent with what's described as disease-associated microglia (DAM) in the literature, a distinct activation state that emerges in neurodegeneration.

Disease Progression Captured Across Braak Stages

Cells from all three Braak stages were distributed across every cluster in the UMAP. This reflects that AD-associated transcriptomic changes are not confined to one cell type, they propagate across the whole cellular ecosystem as the disease progresses.

What I Learned

Memory management matters. 60K+ cells × 30K+ genes is a big matrix. Working with sparse AnnData objects and being deliberate about which steps you checkpoint to disk makes a real difference.
Cell type annotation is an art. The dataset came with pre-annotated cell types, but validating them against canonical marker genes (the dotplot step) is essential and satisfying when the biology confirms itself.
Volcano plots are still one of the most readable ways to communicate differential expression. They give you significance and fold change in one glance.

The Code

Everything is in a fully annotated Jupyter Notebook. If you want to reproduce the analysis, download the H5AD file from CZ CELLxGENE and drop it in the data/ folder.

Farhan89082 / alzheimers-scrna-analysis

Single-cell transcriptomic analysis of Alzheimer's disease using Scanpy - cell-type-specific gene expression in the human prefrontal cortex

🧠 Single-Cell Transcriptomic Analysis of Alzheimer's Disease

Cell-Type-Specific Gene Expression Changes in the Human Superior Frontal Gyrus

📌 Background

Alzheimer's disease (AD) is the most common form of dementia, affecting over 55 million people worldwide. While the hallmarks of AD — amyloid plaques and neurofibrillary tangles — are well established, the cell-type-specific molecular changes that drive neurodegeneration remain incompletely understood.

Single-nucleus RNA sequencing (snRNA-seq) enables transcriptomic profiling of individual cells in post-mortem human brain tissue, making it a powerful tool for dissecting the cellular basis of AD. This project analyses a publicly available snRNA-seq dataset of the human superior frontal gyrus from AD and cognitively normal donors, sourced from the CZ CELLxGENE Discover platform. The dataset contains 63,608 nuclei across 7 major brain cell types and three Braak stages (0, 2, and 6), enabling analysis of both disease status and progression severity.

🎯 Objectives

Perform quality control, normalisation, and dimensionality…

View on GitHub

If you're working with single-cell data or have questions about the pipeline, I'd love to hear from you in the comments. There's something fascinating about watching biology emerge from a matrix of gene counts.