<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Farhan Rehman Sherief</title>
    <description>The latest articles on DEV Community by Farhan Rehman Sherief (@farhansherief).</description>
    <link>https://dev.to/farhansherief</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3971649%2F4fea112e-39e3-4e68-a18e-7205a7269fc3.jpg</url>
      <title>DEV Community: Farhan Rehman Sherief</title>
      <link>https://dev.to/farhansherief</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/farhansherief"/>
    <language>en</language>
    <item>
      <title>How I Used Python to Analyse 40,000 Human Gut Cells and Uncover What Makes Crohn's Disease Different from Colitis</title>
      <dc:creator>Farhan Rehman Sherief</dc:creator>
      <pubDate>Sun, 07 Jun 2026 10:52:39 +0000</pubDate>
      <link>https://dev.to/farhansherief/how-i-used-python-to-analyse-40000-human-gut-cells-and-uncover-what-makes-crohns-disease-3npd</link>
      <guid>https://dev.to/farhansherief/how-i-used-python-to-analyse-40000-human-gut-cells-and-uncover-what-makes-crohns-disease-3npd</guid>
      <description>&lt;p&gt;&lt;em&gt;A step-by-step walkthrough of my multi-sample single-cell RNA sequencing project - written for anyone curious about how computational biology actually works&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Before we start - what is this article actually about?
&lt;/h2&gt;

&lt;p&gt;Imagine being able to take a tiny sample of tissue from a patient's gut, and instead of just knowing "there are cells here", you could read the activity of every single gene inside every single individual cell - thousands of cells at once.&lt;/p&gt;

&lt;p&gt;That's what single-cell RNA sequencing (scRNA-seq) does. And in this article, I'll walk you through how I used Python to analyse data from 40,000 human gut cells across 18 patients to understand what makes two similar gut diseases - Crohn's disease and ulcerative colitis - biologically different from each other.&lt;/p&gt;

&lt;p&gt;No biology PhD required. I'll explain every term as we go.&lt;/p&gt;




&lt;h2&gt;
  
  
  The biological problem - two diseases that look the same but aren't
&lt;/h2&gt;

&lt;p&gt;Crohn's disease (CD) and ulcerative colitis (UC) are both types of Inflammatory Bowel Disease (IBD). Both cause chronic inflammation in the gut, both cause pain and discomfort, and both are lifelong conditions. Doctors have known for decades that they're different diseases, but at a surface level they're easy to confuse.&lt;/p&gt;

&lt;p&gt;The key difference is &lt;strong&gt;where&lt;/strong&gt; and &lt;strong&gt;how&lt;/strong&gt; they inflame:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Crohn's disease&lt;/strong&gt; can affect any part of the digestive tract, tends to cause deep, patchy inflammation, and is driven largely by a type of immune cell called a myeloid cell (think macrophages - the "pac-man" cells of the immune system)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ulcerative colitis&lt;/strong&gt; affects only the colon and rectum, causes continuous surface inflammation, and is driven more by antibody-producing cells called plasma cells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these differences at the level of individual cells - not just tissue - is critical for developing better, more targeted treatments. That's where single-cell sequencing comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The data - 18 patients, 40,000 cells, one big challenge
&lt;/h2&gt;

&lt;p&gt;I used a publicly available dataset from the CZ CELLxGENE platform containing colonic mucosa (colon lining) biopsies from 18 donors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 patients with Crohn's disease&lt;/li&gt;
&lt;li&gt;6 patients with ulcerative colitis&lt;/li&gt;
&lt;li&gt;6 healthy controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each donor contributed thousands of cells, giving us 46,700 cells total (40,084 after quality filtering).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here's the problem:&lt;/strong&gt; when you combine data from 18 different people, collected at different times, processed slightly differently in the lab - the data gets messy. The technical differences between samples (called &lt;strong&gt;batch effects&lt;/strong&gt;) can be so strong that they hide the real biological differences you actually care about.&lt;/p&gt;

&lt;p&gt;Think of it like trying to compare photos taken by 18 different cameras with different colour settings. The same object might look completely different just because of the camera, not because the object actually changed.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;Harmony&lt;/strong&gt; comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Harmony and why does it matter?
&lt;/h2&gt;

&lt;p&gt;Harmony is a batch correction algorithm. In plain English: it's a mathematical tool that looks at all 18 samples, figures out which differences between samples are due to technical variation (the "camera settings"), and removes those differences — leaving only the real biological signal.&lt;/p&gt;

&lt;p&gt;Here's what the data looks like &lt;strong&gt;before&lt;/strong&gt; Harmony:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PCA plot coloured by donor - you can see individual donors clustering separately, meaning donor identity is driving the structure more than the actual biology.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And &lt;strong&gt;after&lt;/strong&gt; Harmony:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;UMAP plot coloured by donor - all 18 donors are now mixed uniformly within each cell type cluster. The batch effects are gone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But crucially - when we colour the same post-Harmony UMAP by &lt;strong&gt;disease group&lt;/strong&gt; instead of donor:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;UMAP coloured by disease - Crohn's, UC, and normal cells now separate into distinct regions. The biology is preserved, the noise is removed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This before/after comparison is the core technical contribution of this project. Without Harmony, any findings could be explained by "oh, that's just because donor 3 was processed differently." With Harmony, we can be confident the differences are real.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting up the pipeline in Python
&lt;/h2&gt;

&lt;p&gt;The full analysis uses &lt;strong&gt;Scanpy&lt;/strong&gt; - the standard Python library for single-cell analysis - along with &lt;strong&gt;harmonypy&lt;/strong&gt; for the integration step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scanpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;harmonypy&lt;/span&gt;

&lt;span class="c1"&gt;# Load the dataset
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_h5ad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data/ibd_dataset.h5ad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 46,700 cells × 32,354 genes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is stored in an &lt;code&gt;AnnData&lt;/code&gt; object - think of it as a very specialised spreadsheet where rows are cells, columns are genes, and there's extra space to store metadata like disease status, donor ID, and cell type labels.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 - Quality control
&lt;/h3&gt;

&lt;p&gt;Not every cell in a sequencing experiment is a real, healthy cell. Some are damaged, some are empty droplets that got accidentally captured, and some are doublets (two cells mistakenly counted as one). We filter these out using three metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remove low quality cells
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_genes_by_counts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# too few genes = empty droplet
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_genes_by_counts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# too many genes = doublet
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pct_counts_mt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;        &lt;span class="c1"&gt;# high mitochondrial % = dying cell
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why mitochondrial genes?&lt;/strong&gt; When a cell is dying or stressed, the nucleus breaks down and releases its RNA - but mitochondria (the cell's energy factories) have their own separate DNA and RNA that stays intact longer. So a high percentage of mitochondrial gene reads is a reliable sign of a low-quality cell.&lt;/p&gt;

&lt;p&gt;After filtering: 40,084 cells remain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 - Normalisation
&lt;/h3&gt;

&lt;p&gt;Different cells capture different amounts of RNA simply due to technical variation in the sequencing process. A cell with 10,000 RNA molecules captured will look like it expresses every gene more than a cell with only 2,000 captured — even if their true biology is identical.&lt;/p&gt;

&lt;p&gt;We normalise by scaling every cell to have the same total count (10,000), then apply a log transformation to reduce the influence of very highly expressed genes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize_total&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_sum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log1p&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3 - Finding highly variable genes
&lt;/h3&gt;

&lt;p&gt;Out of 32,354 genes, most are either not expressed at all or expressed at the same level in every cell (housekeeping genes that keep basic cell functions running). These genes add noise without adding information.&lt;/p&gt;

&lt;p&gt;We select only the &lt;strong&gt;highly variable genes&lt;/strong&gt; - genes whose expression varies meaningfully between cells - for downstream analysis. We found 2,873 of these, using &lt;code&gt;batch_key='donor_id'&lt;/code&gt; to ensure we pick genes that are variable across all donors, not just one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;highly_variable_genes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0125&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;min_disp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;donor_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Result: 2,873 highly variable genes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4 - PCA and Harmony integration
&lt;/h3&gt;

&lt;p&gt;PCA (Principal Component Analysis) reduces our 2,873-gene matrix into 50 dimensions that capture the most important variation. Then Harmony corrects for batch effects within this PCA space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pca&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;svd_solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arpack&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;external&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;harmony_integrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;donor_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter_harmony&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Harmony converged in just 6 iterations
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fact that Harmony converged in only 6 out of 20 possible iterations is a good sign - it means the batch effects were relatively structured and correctable, and the biological signal is strong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 - UMAP visualisation
&lt;/h3&gt;

&lt;p&gt;UMAP (Uniform Manifold Approximation and Projection) takes the Harmony-corrected embeddings and projects them into 2D for visualisation. Similar cells end up close together, different cells end up far apart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_rep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X_pca_harmony&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;umap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What we found - the biology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Finding 1: Five major cell populations in the gut
&lt;/h3&gt;

&lt;p&gt;The UMAP revealed five well-separated clusters corresponding to the major cell types in the colon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plasma cells&lt;/strong&gt; (15,633 cells) - antibody factories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Colon epithelial cells&lt;/strong&gt; (12,347 cells) - the cells lining the gut wall&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T cells&lt;/strong&gt; (12,128 cells) - immune soldiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Myeloid cells&lt;/strong&gt; (3,771 cells) - macrophages and related immune cells&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stromal cells&lt;/strong&gt; (2,821 cells) - structural support cells&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Finding 2: UC and CD have completely different cellular makeups
&lt;/h3&gt;

&lt;p&gt;This is where the biology gets interesting. When we look at what proportion of cells belong to each type across disease groups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plasma cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~27%&lt;/li&gt;
&lt;li&gt;Crohn's disease: ~30%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: &lt;strong&gt;~52%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;UC has nearly double the plasma cell proportion of healthy tissue. Plasma cells make antibodies - this confirms that UC is primarily driven by antibody-mediated (humoral) immunity, a well-established finding that we independently reproduced from raw data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Myeloid cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~6%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: ~8%&lt;/li&gt;
&lt;li&gt;Crohn's disease: &lt;strong&gt;~13%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CD has more than double the myeloid cell proportion. Myeloid cells include macrophages - the cells responsible for the granulomatous (nodule-forming) inflammation that's the hallmark of Crohn's disease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Epithelial cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~29%&lt;/li&gt;
&lt;li&gt;Crohn's disease: ~19%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: &lt;strong&gt;~3%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is striking. UC patients have almost no epithelial cells left - the gut lining is severely disrupted. This explains why UC causes such pronounced mucosal damage and bleeding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 3: The S100A8/S100A9 signature - a clinical biomarker reproduced at single-cell resolution
&lt;/h3&gt;

&lt;p&gt;To find which specific genes are driving the myeloid difference in Crohn's disease, we ran a differential expression analysis comparing CD myeloid cells against normal myeloid cells using the Wilcoxon rank-sum test.&lt;/p&gt;

&lt;p&gt;The top upregulated genes were:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gene&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S100A8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Subunit of calprotectin - a protein released by activated immune cells&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S100A9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The other subunit of calprotectin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CXCL8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Also known as IL-8 - a chemical signal that recruits more immune cells to the inflammation site&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IL1RN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An anti-inflammatory signal - the body trying to dampen its own response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BCL2A1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keeps immune cells alive longer in the inflamed environment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's why S100A8 and S100A9 matter: together they form a protein called &lt;strong&gt;calprotectin&lt;/strong&gt;. When gut inflammation is active, immune cells release calprotectin into the stool. Doctors measure this in a routine test called the &lt;strong&gt;faecal calprotectin test&lt;/strong&gt; - one of the most common non-invasive ways to monitor IBD disease activity in clinic.&lt;/p&gt;

&lt;p&gt;By analysing raw single-cell data, we independently identified the exact genes behind this clinical test - at the resolution of individual cells. That's the kind of validation that confirms the analysis is biologically meaningful, not just a statistical artefact.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned - technical takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Batch correction is non-negotiable for multi-sample studies.&lt;/strong&gt; Without Harmony, the donor-to-donor variation would swamp the disease signal. Any "finding" could be explained by technical noise. Harmony is now a standard step in any multi-patient single-cell study.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Choosing highly variable genes with &lt;code&gt;batch_key&lt;/code&gt; matters.&lt;/strong&gt; If you find HVGs without accounting for batches, you risk selecting genes that are variable only because of technical differences between samples. Using &lt;code&gt;batch_key='donor_id'&lt;/code&gt; ensures you're finding genes that are genuinely biologically variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Clinical relevance is the best validation.&lt;/strong&gt; When your computational analysis independently reproduces a well-established clinical biomarker (faecal calprotectin), it gives you confidence that the pipeline is working correctly and the findings are real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Public datasets are powerful.&lt;/strong&gt; This entire analysis used freely available data from CZ CELLxGENE - no lab access required. The tools (Scanpy, harmonypy) are free and open source. Computational biology has an exceptionally low barrier to entry compared to wet lab science.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full code
&lt;/h2&gt;

&lt;p&gt;The complete annotated notebook is available on GitHub:&lt;br&gt;
👉 &lt;a href="https://github.com/Farhan89082/ibd-harmony-integration" rel="noopener noreferrer"&gt;github.com/Farhan89082/ibd-harmony-integration&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README includes all figures, biological interpretation, and instructions for reproducing the analysis from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;This project is part of a series of three single-cell RNA sequencing analyses I've built for my computational biology portfolio:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alzheimer's Disease&lt;/strong&gt; - microglial activation and mitochondrial dysfunction in human brain cells → &lt;a href="https://github.com/Farhan89082/alzheimers-scrna-analysis" rel="noopener noreferrer"&gt;github.com/Farhan89082/alzheimers-scrna-analysis&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NSCLC Tumour Microenvironment&lt;/strong&gt; - T cell exhaustion trajectories and macrophage polarisation in lung cancer → &lt;a href="https://github.com/Farhan89082/nsclc-tumour-microenvironment" rel="noopener noreferrer"&gt;github.com/Farhan89082/nsclc-tumour-microenvironment&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IBD Harmony Integration&lt;/strong&gt; - this article&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're a Python developer curious about getting into computational biology, or a biology student learning to code, I hope this walkthrough shows that the barrier is lower than it looks. The tools are excellent, the data is freely available, and the biology is genuinely fascinating.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about the analysis or want to discuss the methodology? Drop a comment below - I'd love to hear from you.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>python</category>
      <category>bioinformatics</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>How I Used Python to Analyse 40,000 Human Gut Cells and Uncover What Makes Crohn's Disease Different from Colitis</title>
      <dc:creator>Farhan Rehman Sherief</dc:creator>
      <pubDate>Sun, 07 Jun 2026 10:52:39 +0000</pubDate>
      <link>https://dev.to/farhansherief/how-i-used-python-to-analyse-40000-human-gut-cells-and-uncover-what-makes-crohns-disease-47h0</link>
      <guid>https://dev.to/farhansherief/how-i-used-python-to-analyse-40000-human-gut-cells-and-uncover-what-makes-crohns-disease-47h0</guid>
      <description>&lt;p&gt;&lt;em&gt;A step-by-step walkthrough of my multi-sample single-cell RNA sequencing project — written for anyone curious about how computational biology actually works&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Before we start - what is this article actually about?
&lt;/h2&gt;

&lt;p&gt;Imagine being able to take a tiny sample of tissue from a patient's gut, and instead of just knowing "there are cells here", you could read the activity of every single gene inside every single individual cell - thousands of cells at once.&lt;/p&gt;

&lt;p&gt;That's what single-cell RNA sequencing (scRNA-seq) does. And in this article, I'll walk you through how I used Python to analyse data from 40,000 human gut cells across 18 patients to understand what makes two similar gut diseases — Crohn's disease and ulcerative colitis - biologically different from each other.&lt;/p&gt;

&lt;p&gt;No biology PhD required. I'll explain every term as we go.&lt;/p&gt;




&lt;h2&gt;
  
  
  The biological problem - two diseases that look the same but aren't
&lt;/h2&gt;

&lt;p&gt;Crohn's disease (CD) and ulcerative colitis (UC) are both types of Inflammatory Bowel Disease (IBD). Both cause chronic inflammation in the gut, both cause pain and discomfort, and both are lifelong conditions. Doctors have known for decades that they're different diseases, but at a surface level they're easy to confuse.&lt;/p&gt;

&lt;p&gt;The key difference is &lt;strong&gt;where&lt;/strong&gt; and &lt;strong&gt;how&lt;/strong&gt; they inflame:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Crohn's disease&lt;/strong&gt; can affect any part of the digestive tract, tends to cause deep, patchy inflammation, and is driven largely by a type of immune cell called a myeloid cell (think macrophages - the "pac-man" cells of the immune system)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ulcerative colitis&lt;/strong&gt; affects only the colon and rectum, causes continuous surface inflammation, and is driven more by antibody-producing cells called plasma cells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these differences at the level of individual cells - not just tissue - is critical for developing better, more targeted treatments. That's where single-cell sequencing comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  The data - 18 patients, 40,000 cells, one big challenge
&lt;/h2&gt;

&lt;p&gt;I used a publicly available dataset from the CZ CELLxGENE platform containing colonic mucosa (colon lining) biopsies from 18 donors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;6 patients with Crohn's disease&lt;/li&gt;
&lt;li&gt;6 patients with ulcerative colitis&lt;/li&gt;
&lt;li&gt;6 healthy controls&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each donor contributed thousands of cells, giving us 46,700 cells total (40,084 after quality filtering).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;But here's the problem:&lt;/strong&gt; when you combine data from 18 different people, collected at different times, processed slightly differently in the lab - the data gets messy. The technical differences between samples (called &lt;strong&gt;batch effects&lt;/strong&gt;) can be so strong that they hide the real biological differences you actually care about.&lt;/p&gt;

&lt;p&gt;Think of it like trying to compare photos taken by 18 different cameras with different colour settings. The same object might look completely different just because of the camera, not because the object actually changed.&lt;/p&gt;

&lt;p&gt;This is where &lt;strong&gt;Harmony&lt;/strong&gt; comes in.&lt;/p&gt;




&lt;h2&gt;
  
  
  What is Harmony and why does it matter?
&lt;/h2&gt;

&lt;p&gt;Harmony is a batch correction algorithm. In plain English: it's a mathematical tool that looks at all 18 samples, figures out which differences between samples are due to technical variation (the "camera settings"), and removes those differences — leaving only the real biological signal.&lt;/p&gt;

&lt;p&gt;Here's what the data looks like &lt;strong&gt;before&lt;/strong&gt; Harmony:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;PCA plot coloured by donor - you can see individual donors clustering separately, meaning donor identity is driving the structure more than the actual biology.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And &lt;strong&gt;after&lt;/strong&gt; Harmony:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;UMAP plot coloured by donor - all 18 donors are now mixed uniformly within each cell type cluster. The batch effects are gone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But crucially - when we colour the same post-Harmony UMAP by &lt;strong&gt;disease group&lt;/strong&gt; instead of donor:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;UMAP coloured by disease - Crohn's, UC, and normal cells now separate into distinct regions. The biology is preserved, the noise is removed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This before/after comparison is the core technical contribution of this project. Without Harmony, any findings could be explained by "oh, that's just because donor 3 was processed differently." With Harmony, we can be confident the differences are real.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting up the pipeline in Python
&lt;/h2&gt;

&lt;p&gt;The full analysis uses &lt;strong&gt;Scanpy&lt;/strong&gt; - the standard Python library for single-cell analysis - along with &lt;strong&gt;harmonypy&lt;/strong&gt; for the integration step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scanpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;harmonypy&lt;/span&gt;

&lt;span class="c1"&gt;# Load the dataset
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_h5ad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;data/ibd_dataset.h5ad&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# 46,700 cells × 32,354 genes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The data is stored in an &lt;code&gt;AnnData&lt;/code&gt; object - think of it as a very specialised spreadsheet where rows are cells, columns are genes, and there's extra space to store metadata like disease status, donor ID, and cell type labels.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 - Quality control
&lt;/h3&gt;

&lt;p&gt;Not every cell in a sequencing experiment is a real, healthy cell. Some are damaged, some are empty droplets that got accidentally captured, and some are doublets (two cells mistakenly counted as one). We filter these out using three metrics:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remove low quality cells
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_genes_by_counts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# too few genes = empty droplet
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;n_genes_by_counts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;6000&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# too many genes = doublet
&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;obs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pct_counts_mt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;        &lt;span class="c1"&gt;# high mitochondrial % = dying cell
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why mitochondrial genes?&lt;/strong&gt; When a cell is dying or stressed, the nucleus breaks down and releases its RNA - but mitochondria (the cell's energy factories) have their own separate DNA and RNA that stays intact longer. So a high percentage of mitochondrial gene reads is a reliable sign of a low-quality cell.&lt;/p&gt;

&lt;p&gt;After filtering: 40,084 cells remain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 - Normalisation
&lt;/h3&gt;

&lt;p&gt;Different cells capture different amounts of RNA simply due to technical variation in the sequencing process. A cell with 10,000 RNA molecules captured will look like it expresses every gene more than a cell with only 2,000 captured — even if their true biology is identical.&lt;/p&gt;

&lt;p&gt;We normalise by scaling every cell to have the same total count (10,000), then apply a log transformation to reduce the influence of very highly expressed genes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;normalize_total&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_sum&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log1p&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3 - Finding highly variable genes
&lt;/h3&gt;

&lt;p&gt;Out of 32,354 genes, most are either not expressed at all or expressed at the same level in every cell (housekeeping genes that keep basic cell functions running). These genes add noise without adding information.&lt;/p&gt;

&lt;p&gt;We select only the &lt;strong&gt;highly variable genes&lt;/strong&gt; - genes whose expression varies meaningfully between cells - for downstream analysis. We found 2,873 of these, using &lt;code&gt;batch_key='donor_id'&lt;/code&gt; to ensure we pick genes that are variable across all donors, not just one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;highly_variable_genes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.0125&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                             &lt;span class="n"&gt;min_disp&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;batch_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;donor_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Result: 2,873 highly variable genes
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4 - PCA and Harmony integration
&lt;/h3&gt;

&lt;p&gt;PCA (Principal Component Analysis) reduces our 2,873-gene matrix into 50 dimensions that capture the most important variation. Then Harmony corrects for batch effects within this PCA space:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pca&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;svd_solver&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;arpack&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;external&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;harmony_integrate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;donor_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_iter_harmony&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Harmony converged in just 6 iterations
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fact that Harmony converged in only 6 out of 20 possible iterations is a good sign - it means the batch effects were relatively structured and correctable, and the biological signal is strong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5 - UMAP visualisation
&lt;/h3&gt;

&lt;p&gt;UMAP (Uniform Manifold Approximation and Projection) takes the Harmony-corrected embeddings and projects them into 2D for visualisation. Similar cells end up close together, different cells end up far apart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;neighbors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;use_rep&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;X_pca_harmony&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_neighbors&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;umap&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;adata&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  What we found - the biology
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Finding 1: Five major cell populations in the gut
&lt;/h3&gt;

&lt;p&gt;The UMAP revealed five well-separated clusters corresponding to the major cell types in the colon:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Plasma cells&lt;/strong&gt; (15,633 cells) - antibody factories&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Colon epithelial cells&lt;/strong&gt; (12,347 cells) - the cells lining the gut wall&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;T cells&lt;/strong&gt; (12,128 cells) - immune soldiers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Myeloid cells&lt;/strong&gt; (3,771 cells) - macrophages and related immune cells&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stromal cells&lt;/strong&gt; (2,821 cells) - structural support cells&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Finding 2: UC and CD have completely different cellular makeups
&lt;/h3&gt;

&lt;p&gt;This is where the biology gets interesting. When we look at what proportion of cells belong to each type across disease groups:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Plasma cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~27%&lt;/li&gt;
&lt;li&gt;Crohn's disease: ~30%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: &lt;strong&gt;~52%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;UC has nearly double the plasma cell proportion of healthy tissue. Plasma cells make antibodies - this confirms that UC is primarily driven by antibody-mediated (humoral) immunity, a well-established finding that we independently reproduced from raw data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Myeloid cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~6%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: ~8%&lt;/li&gt;
&lt;li&gt;Crohn's disease: &lt;strong&gt;~13%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;CD has more than double the myeloid cell proportion. Myeloid cells include macrophages - the cells responsible for the granulomatous (nodule-forming) inflammation that's the hallmark of Crohn's disease.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Epithelial cells:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Normal: ~29%&lt;/li&gt;
&lt;li&gt;Crohn's disease: ~19%&lt;/li&gt;
&lt;li&gt;Ulcerative colitis: &lt;strong&gt;~3%&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is striking. UC patients have almost no epithelial cells left - the gut lining is severely disrupted. This explains why UC causes such pronounced mucosal damage and bleeding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Finding 3: The S100A8/S100A9 signature - a clinical biomarker reproduced at single-cell resolution
&lt;/h3&gt;

&lt;p&gt;To find which specific genes are driving the myeloid difference in Crohn's disease, we ran a differential expression analysis comparing CD myeloid cells against normal myeloid cells using the Wilcoxon rank-sum test.&lt;/p&gt;

&lt;p&gt;The top upregulated genes were:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Gene&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S100A8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Subunit of calprotectin - a protein released by activated immune cells&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S100A9&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The other subunit of calprotectin&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CXCL8&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Also known as IL-8 - a chemical signal that recruits more immune cells to the inflammation site&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;IL1RN&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;An anti-inflammatory signal - the body trying to dampen its own response&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;BCL2A1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Keeps immune cells alive longer in the inflamed environment&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Here's why S100A8 and S100A9 matter: together they form a protein called &lt;strong&gt;calprotectin&lt;/strong&gt;. When gut inflammation is active, immune cells release calprotectin into the stool. Doctors measure this in a routine test called the &lt;strong&gt;faecal calprotectin test&lt;/strong&gt; - one of the most common non-invasive ways to monitor IBD disease activity in clinic.&lt;/p&gt;

&lt;p&gt;By analysing raw single-cell data, we independently identified the exact genes behind this clinical test - at the resolution of individual cells. That's the kind of validation that confirms the analysis is biologically meaningful, not just a statistical artefact.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I learned - technical takeaways
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Batch correction is non-negotiable for multi-sample studies.&lt;/strong&gt; Without Harmony, the donor-to-donor variation would swamp the disease signal. Any "finding" could be explained by technical noise. Harmony is now a standard step in any multi-patient single-cell study.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Choosing highly variable genes with &lt;code&gt;batch_key&lt;/code&gt; matters.&lt;/strong&gt; If you find HVGs without accounting for batches, you risk selecting genes that are variable only because of technical differences between samples. Using &lt;code&gt;batch_key='donor_id'&lt;/code&gt; ensures you're finding genes that are genuinely biologically variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Clinical relevance is the best validation.&lt;/strong&gt; When your computational analysis independently reproduces a well-established clinical biomarker (faecal calprotectin), it gives you confidence that the pipeline is working correctly and the findings are real.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Public datasets are powerful.&lt;/strong&gt; This entire analysis used freely available data from CZ CELLxGENE - no lab access required. The tools (Scanpy, harmonypy) are free and open source. Computational biology has an exceptionally low barrier to entry compared to wet lab science.&lt;/p&gt;




&lt;h2&gt;
  
  
  The full code
&lt;/h2&gt;

&lt;p&gt;The complete annotated notebook is available on GitHub:&lt;br&gt;
👉 &lt;a href="https://github.com/Farhan89082/ibd-harmony-integration" rel="noopener noreferrer"&gt;github.com/Farhan89082/ibd-harmony-integration&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README includes all figures, biological interpretation, and instructions for reproducing the analysis from scratch.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;This project is part of a series of three single-cell RNA sequencing analyses I've built for my computational biology portfolio:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Alzheimer's Disease&lt;/strong&gt; - microglial activation and mitochondrial dysfunction in human brain cells → &lt;a href="https://github.com/Farhan89082/alzheimers-scrna-analysis" rel="noopener noreferrer"&gt;github.com/Farhan89082/alzheimers-scrna-analysis&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NSCLC Tumour Microenvironment&lt;/strong&gt; - T cell exhaustion trajectories and macrophage polarisation in lung cancer → &lt;a href="https://github.com/Farhan89082/nsclc-tumour-microenvironment" rel="noopener noreferrer"&gt;github.com/Farhan89082/nsclc-tumour-microenvironment&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IBD Harmony Integration&lt;/strong&gt; - this article&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're a Python developer curious about getting into computational biology, or a biology student learning to code, I hope this walkthrough shows that the barrier is lower than it looks. The tools are excellent, the data is freely available, and the biology is genuinely fascinating.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Have questions about the analysis or want to discuss the methodology? Drop a comment below - I'd love to hear from you.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>python</category>
      <category>bioinformatics</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>T Cells, Tumour Macrophages, and Why Lung Cancer Evades Your Immune System</title>
      <dc:creator>Farhan Rehman Sherief</dc:creator>
      <pubDate>Sat, 06 Jun 2026 18:54:14 +0000</pubDate>
      <link>https://dev.to/farhansherief/t-cells-tumour-macrophages-and-why-lung-cancer-evades-your-immune-system-1mf7</link>
      <guid>https://dev.to/farhansherief/t-cells-tumour-macrophages-and-why-lung-cancer-evades-your-immune-system-1mf7</guid>
      <description>&lt;p&gt;One of the most frustrating puzzles in cancer biology: some lung cancer patients respond brilliantly to immunotherapy. Others don't respond at all. The tumour microenvironment (TME), the ecosystem of immune, stromal, and cancer cells that surrounds a tumour is a big part of why.&lt;/p&gt;

&lt;p&gt;I wanted to understand what that ecosystem actually looks like at the resolution of individual cells. So I built a single-cell RNA sequencing (scRNA-seq) analysis of non-small cell lung cancer (NSCLC) tissue using Scanpy, working with a publicly available dataset of 111,683 cells from CZ CELLxGENE, spanning 38 distinct cell states.&lt;/p&gt;

&lt;p&gt;Here's what the data revealed about T cell exhaustion, macrophage behaviour, and what makes some tumours more immune-suppressive than others.&lt;/p&gt;




&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;NSCLC accounts for approximately 85% of all lung cancer cases. Immune checkpoint inhibitors have transformed treatment for some patients — but a large proportion still don't respond. The leading explanation is that the TME is actively suppressing anti-tumour immunity, rather than the immune system simply being absent.&lt;/p&gt;

&lt;p&gt;To understand this, you need to look not just at whether immune cells are present, but at &lt;strong&gt;what functional state those cells are in&lt;/strong&gt;. That's exactly what single-cell RNA sequencing makes possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dataset
&lt;/h2&gt;

&lt;p&gt;The data comes from a study characterising the cellular and molecular identities of histologic subtypes in lung adenocarcinoma:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;111,683 cells&lt;/strong&gt; after QC (117,266 raw)&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;57,398 genes&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;38 distinct cell states&lt;/strong&gt; across the TME&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;4 histologic subtypes:&lt;/strong&gt; Acinar/Papillary (A/P), A/P + Solid, Micropapillary (MP), and Solid&lt;/li&gt;
&lt;li&gt;Sequenced with 10× Chromium 3' v3&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Quality Control
&lt;/h3&gt;

&lt;p&gt;Cells were filtered to retain those with 200–8,000 detected genes, fewer than 100,000 UMI counts, and less than 25% mitochondrial content. This removed around 5,583 low-quality cells.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Normalisation
&lt;/h3&gt;

&lt;p&gt;Standard scRNA-seq approach: normalise to 10,000 counts per cell, log1p transform, identify 2,544 highly variable genes.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Dimensionality Reduction
&lt;/h3&gt;

&lt;p&gt;PCA (50 components) → neighbourhood graph → UMAP. The resulting embedding resolved all 38 cell states, with clear separation between tumour epithelial cells, immune populations, and stromal cells.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. T Cell Exhaustion Is Not an End State - It's a Trajectory
&lt;/h3&gt;

&lt;p&gt;This was the most biologically interesting result. The CD8+ T cell data revealed three functionally distinct states forming a clear exhaustion trajectory:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;T.CD8.Naive → T.CD8.Predysfunc (8,897 cells) → T.CD8.Exhausted (1,454 cells)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The pre-dysfunctional population is the &lt;em&gt;largest&lt;/em&gt; CD8 group - meaning the majority of cytotoxic T cells in these tumours are actively being pushed toward dysfunction, not yet exhausted but clearly losing their killing capacity.&lt;/p&gt;

&lt;p&gt;The fully exhausted population co-expresses multiple immune checkpoint molecules simultaneously: &lt;code&gt;CTLA4&lt;/code&gt;, &lt;code&gt;LAG3&lt;/code&gt;, &lt;code&gt;TIGIT&lt;/code&gt;, &lt;code&gt;TOX&lt;/code&gt;, &lt;code&gt;ENTPD1&lt;/code&gt;, and &lt;code&gt;CXCL13&lt;/code&gt;. In contrast, cytotoxic T cells express effector molecules (&lt;code&gt;GZMB&lt;/code&gt;, &lt;code&gt;PRF1&lt;/code&gt;, &lt;code&gt;IFNG&lt;/code&gt;) with minimal checkpoint expression. The dotplot makes this contrast striking and clear.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Regulatory T Cells Outnumber Cytotoxic T Cells ~4:1
&lt;/h3&gt;

&lt;p&gt;Tregs (6,496 cells) vastly outnumber cytotoxic CD8+ T cells (1,688 cells). Tregs express high levels of &lt;code&gt;CTLA4&lt;/code&gt; and &lt;code&gt;TOX&lt;/code&gt;, consistent with their role in actively suppressing anti-tumour immunity. This ratio alone explains a great deal about why these tumours resist immune attack.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Pro-Tumour Macrophages Dominate the Myeloid Compartment
&lt;/h3&gt;

&lt;p&gt;Five macrophage subpopulations were identified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Mac.SPP1&lt;/strong&gt; and &lt;strong&gt;Mac.SPP1.GPNMB&lt;/strong&gt; (~7,054 cells combined) - pro-tumourigenic, SPP1/osteopontin-expressing macrophages associated with poor prognosis in multiple cancer types&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mac.SELENOP&lt;/strong&gt; (3,082 cells) - anti-inflammatory, tissue-resident&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mac.CXCL9&lt;/strong&gt; (1,118 cells) - anti-tumour, recruits cytotoxic T cells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pro-tumour to anti-tumour macrophage ratio is approximately &lt;strong&gt;6:1&lt;/strong&gt;. Combined with the T cell exhaustion data, this paints a picture of a TME that is actively - and efficiently - suppressing immune responses at multiple levels simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. The Paradox of Solid Tumours
&lt;/h3&gt;

&lt;p&gt;Solid histologic subtype tumours had the highest immune infiltration (~75% immune cells vs ~63% in acinar/papillary). But solid subtype is also the most aggressive histology.&lt;/p&gt;

&lt;p&gt;This counterintuitive finding is consistent with what's sometimes called &lt;strong&gt;immune suppression rather than immune exclusion&lt;/strong&gt;: more immune cells arrive, but the suppressive environment is so potent that they can't function. High infiltration doesn't automatically mean effective anti-tumour immunity.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for Immunotherapy
&lt;/h2&gt;

&lt;p&gt;The data points to a few specific vulnerabilities worth thinking about:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The large &lt;strong&gt;pre-dysfunctional T cell population&lt;/strong&gt; is a potential therapeutic opportunity - these cells aren't fully exhausted yet and might respond to checkpoint blockade&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;6:1 pro-tumour macrophage ratio&lt;/strong&gt; suggests that myeloid reprogramming strategies (not just T cell-targeting therapies) may be needed&lt;/li&gt;
&lt;li&gt;The Treg suppression appears to work partly through &lt;code&gt;CTLA4&lt;/code&gt;, which is why anti-CTLA4 therapy has shown activity in some NSCLC settings&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Subsetting for focused analysis pays off.&lt;/strong&gt; Rather than analysing all 38 cell states at once, isolating the T cell and macrophage compartments separately and running marker analysis on those subsets gave much cleaner, more interpretable results.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Using the authors' cell type annotations directly was the right call.&lt;/strong&gt; For a dataset this complex, re-clustering from scratch would have introduced unnecessary uncertainty. Their labels are validated and biologically grounded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With 38 cell states and multiple findings, deciding what to prioritise in the write-up took as long as some analysis steps.&lt;/strong&gt; Storytelling is part of the work.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;Everything is in a fully annotated Jupyter Notebook. Download the H5AD file from CZ CELLxGENE and place it in the &lt;code&gt;data/&lt;/code&gt; folder to run it. &lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Farhan89082" rel="noopener noreferrer"&gt;
        Farhan89082
      &lt;/a&gt; / &lt;a href="https://github.com/Farhan89082/nsclc-tumour-microenvironment" rel="noopener noreferrer"&gt;
        nsclc-tumour-microenvironment
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Single-cell RNA-seq analysis of the NSCLC tumour microenvironment - T cell exhaustion trajectories and macrophage polarisation in lung adenocarcinoma
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;🫁 scRNA-seq Tumour Microenvironment Analysis: Mapping Immune Cell Infiltration in Non-Small Cell Lung Cancer&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/ea669f7071987d9f7060a32f808785b46a2545d6904316dfee5ae52b2b4d6d02/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31322d626c75653f6c6f676f3d707974686f6e"&gt;&lt;img src="https://camo.githubusercontent.com/ea669f7071987d9f7060a32f808785b46a2545d6904316dfee5ae52b2b4d6d02/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31322d626c75653f6c6f676f3d707974686f6e" alt="Python"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/0c1e5c1d8632d0911579f496350f8fe428414a7c5baae05a807a500a71cca61b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5363616e70792d312e31322e312d677265656e"&gt;&lt;img src="https://camo.githubusercontent.com/0c1e5c1d8632d0911579f496350f8fe428414a7c5baae05a807a500a71cca61b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5363616e70792d312e31322e312d677265656e" alt="Scanpy"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/fcc4486d576902aeff57adc40bb3746081ac36c47c2d8efa685bbb3d837c689a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c732d3131312532433638332d6f72616e6765"&gt;&lt;img src="https://camo.githubusercontent.com/fcc4486d576902aeff57adc40bb3746081ac36c47c2d8efa685bbb3d837c689a/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c732d3131312532433638332d6f72616e6765" alt="Cells"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/5a115de276fcd7c5e10b08980d02af2043a99c63e9a8b33f4dd6c4a79585c66c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c25323054797065732d33382d707572706c65"&gt;&lt;img src="https://camo.githubusercontent.com/5a115de276fcd7c5e10b08980d02af2043a99c63e9a8b33f4dd6c4a79585c66c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c25323054797065732d33382d707572706c65" alt="Cell Types"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/c4d4c5fb44c08b85ff48097669ae3661f4bac620d1d059881409e63ef6e5b84b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5374617475732d436f6d706c6574652d627269676874677265656e"&gt;&lt;img src="https://camo.githubusercontent.com/c4d4c5fb44c08b85ff48097669ae3661f4bac620d1d059881409e63ef6e5b84b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5374617475732d436f6d706c6574652d627269676874677265656e" alt="Status"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;📌 Background&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Non-small cell lung cancer (NSCLC) is the most common form of lung cancer, accounting for approximately 85% of all cases. Despite the success of immune checkpoint inhibitors, a significant proportion of patients do not respond to immunotherapy — a failure largely attributed to T cell exhaustion and immunosuppressive remodelling of the tumour microenvironment (TME).&lt;/p&gt;
&lt;p&gt;The TME is a complex ecosystem of tumour cells, immune cells, and stromal cells that collectively determine whether the immune system can mount an effective anti-tumour response. Understanding the cellular composition and functional states within the TME is critical for identifying new therapeutic targets and predicting immunotherapy response.&lt;/p&gt;
&lt;p&gt;This project performs a comprehensive single-cell RNA sequencing (scRNA-seq) analysis of the NSCLC tumour microenvironment, profiling &lt;strong&gt;111,683 cells&lt;/strong&gt; across &lt;strong&gt;38 distinct cell states&lt;/strong&gt; to characterise T cell exhaustion trajectories, macrophage polarisation states…&lt;/p&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Farhan89082/nsclc-tumour-microenvironment" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;The TME is one of the most complex biological systems we can now study at single-cell resolution - and it's increasingly clear that understanding it is key to making immunotherapy work for more patients. Happy to discuss the analysis, the biology, or the pipeline in the comments.&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>python</category>
      <category>datascience</category>
      <category>science</category>
    </item>
    <item>
      <title>How I Mapped Brain Cell Changes in Alzheimer's Disease Using Single-Cell RNA Sequencing</title>
      <dc:creator>Farhan Rehman Sherief</dc:creator>
      <pubDate>Sat, 06 Jun 2026 18:39:29 +0000</pubDate>
      <link>https://dev.to/farhansherief/how-i-mapped-brain-cell-changes-in-alzheimers-disease-using-single-cell-rna-sequencing-4lim</link>
      <guid>https://dev.to/farhansherief/how-i-mapped-brain-cell-changes-in-alzheimers-disease-using-single-cell-rna-sequencing-4lim</guid>
      <description>&lt;p&gt;Alzheimer's disease affects over 55 million people worldwide, yet the precise molecular changes happening inside individual brain cells remain poorly understood. I wanted to dig into that question - not at the tissue level, but at single-cell resolution.&lt;/p&gt;

&lt;p&gt;So I built a full scRNA-seq analysis pipeline in Python using Scanpy, working with a publicly available dataset of 63,608 nuclei from human prefrontal cortex tissue (sourced from CZ CELLxGENE). The donors spanned three Braak stages: 0 (cognitively normal), 2 (early Alzheimer's), and 6 (severe Alzheimer's).&lt;/p&gt;

&lt;p&gt;Here's what I found and how I found it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dataset
&lt;/h2&gt;

&lt;p&gt;The data came from a study on the molecular characterisation of selectively vulnerable neurons in AD. It covers the superior frontal gyrus, a prefrontal region known to be hit hard by neurodegeneration - and includes seven major brain cell types:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Glutamatergic neurons&lt;/li&gt;
&lt;li&gt;GABAergic neurons&lt;/li&gt;
&lt;li&gt;Oligodendrocytes&lt;/li&gt;
&lt;li&gt;OPCs (oligodendrocyte precursor cells)&lt;/li&gt;
&lt;li&gt;Astrocytes&lt;/li&gt;
&lt;li&gt;Microglia&lt;/li&gt;
&lt;li&gt;Endothelial cells&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;31,997 genes. 63,608 cells. Three disease stages. A lot to work with.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Pipeline
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Quality Control
&lt;/h3&gt;

&lt;p&gt;No dataset is clean out of the box. I filtered cells to keep only those with between 200 and 6,000 detected genes, and excluded anything with more than 20% mitochondrial gene content (high mitochondrial reads usually signal a dying or damaged cell). This removed around 2,809 low-quality cells.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Normalisation
&lt;/h3&gt;

&lt;p&gt;Library sizes were normalised to 10,000 counts per cell, followed by log1p transformation, standard practice that makes cells comparable regardless of how deeply they were sequenced. I then identified 5,607 highly variable genes to focus the downstream analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Dimensionality Reduction
&lt;/h3&gt;

&lt;p&gt;PCA (50 components) → neighbourhood graph (10 neighbours, 20 PCs) → UMAP embedding.&lt;/p&gt;

&lt;p&gt;The UMAP is where the biology starts to become visible. All seven cell types separated into distinct clusters, with clear separation between neuronal subtypes and glial populations.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Differential Expression
&lt;/h3&gt;

&lt;p&gt;For the microglial analysis, I used a Wilcoxon rank-sum test comparing AD vs normal microglia, with Benjamini-Hochberg multiple testing correction to control the false discovery rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Glutamatergic Neurons Are Selectively Depleted
&lt;/h3&gt;

&lt;p&gt;One of the most striking results: glutamatergic (excitatory) neurons dropped from ~34% of cells in normal tissue to ~30% in AD tissue. This might sound like a small shift, but at the scale of 60,000+ cells it's biologically meaningful and it's consistent with what the literature already tells us about the selective vulnerability of excitatory neurons in AD.&lt;/p&gt;

&lt;h3&gt;
  
  
  Alzheimer's Leaves a Clear Signature in Microglia
&lt;/h3&gt;

&lt;p&gt;Microglia are the brain's resident immune cells, and they showed the most dramatic transcriptomic shifts between AD and normal tissue. The differential expression analysis revealed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Upregulated in AD microglia:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MALAT1&lt;/code&gt; - a long non-coding RNA strongly linked to neuroinflammation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FTH1&lt;/code&gt; - ferritin heavy chain, pointing to iron dysregulation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;B2M&lt;/code&gt; - beta-2 microglobulin, a known AD biomarker reflecting immune activation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FOXP1&lt;/code&gt; - a transcription factor tied to microglial activation states&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Downregulated in AD microglia:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;MT-CO3&lt;/code&gt;, &lt;code&gt;MT-CO1&lt;/code&gt;, &lt;code&gt;MT-ATP6&lt;/code&gt;, &lt;code&gt;MT-ND2&lt;/code&gt; - mitochondrial complex genes, suggesting impaired energy metabolism in AD-affected microglia&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This pattern is consistent with what's described as disease-associated microglia (DAM) in the literature, a distinct activation state that emerges in neurodegeneration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Disease Progression Captured Across Braak Stages
&lt;/h3&gt;

&lt;p&gt;Cells from all three Braak stages were distributed across every cluster in the UMAP. This reflects that AD-associated transcriptomic changes are not confined to one cell type, they propagate across the whole cellular ecosystem as the disease progresses.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Memory management matters.&lt;/strong&gt; 60K+ cells × 30K+ genes is a big matrix. Working with sparse AnnData objects and being deliberate about which steps you checkpoint to disk makes a real difference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cell type annotation is an art.&lt;/strong&gt; The dataset came with pre-annotated cell types, but validating them against canonical marker genes (the dotplot step) is essential and satisfying when the biology confirms itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Volcano plots are still one of the most readable ways to communicate differential expression.&lt;/strong&gt; They give you significance and fold change in one glance.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Code
&lt;/h2&gt;

&lt;p&gt;Everything is in a fully annotated Jupyter Notebook. If you want to reproduce the analysis, download the H5AD file from CZ CELLxGENE and drop it in the &lt;code&gt;data/&lt;/code&gt; folder. &lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/Farhan89082" rel="noopener noreferrer"&gt;
        Farhan89082
      &lt;/a&gt; / &lt;a href="https://github.com/Farhan89082/alzheimers-scrna-analysis" rel="noopener noreferrer"&gt;
        alzheimers-scrna-analysis
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Single-cell transcriptomic analysis of Alzheimer's disease using Scanpy - cell-type-specific gene expression in the human prefrontal cortex
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;🧠 Single-Cell Transcriptomic Analysis of Alzheimer's Disease&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h3 class="heading-element"&gt;Cell-Type-Specific Gene Expression Changes in the Human Superior Frontal Gyrus&lt;/h3&gt;
&lt;/div&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/ea669f7071987d9f7060a32f808785b46a2545d6904316dfee5ae52b2b4d6d02/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31322d626c75653f6c6f676f3d707974686f6e"&gt;&lt;img src="https://camo.githubusercontent.com/ea669f7071987d9f7060a32f808785b46a2545d6904316dfee5ae52b2b4d6d02/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e2d332e31322d626c75653f6c6f676f3d707974686f6e" alt="Python"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/0c1e5c1d8632d0911579f496350f8fe428414a7c5baae05a807a500a71cca61b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5363616e70792d312e31322e312d677265656e"&gt;&lt;img src="https://camo.githubusercontent.com/0c1e5c1d8632d0911579f496350f8fe428414a7c5baae05a807a500a71cca61b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5363616e70792d312e31322e312d677265656e" alt="Scanpy"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/017bf63e76a7bb12b804496f8727da2e301f9b9b1c74f363761c693d7e826b6b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c732d36332532433630382d6f72616e6765"&gt;&lt;img src="https://camo.githubusercontent.com/017bf63e76a7bb12b804496f8727da2e301f9b9b1c74f363761c693d7e826b6b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f43656c6c732d36332532433630382d6f72616e6765" alt="Cells"&gt;&lt;/a&gt; &lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/c4d4c5fb44c08b85ff48097669ae3661f4bac620d1d059881409e63ef6e5b84b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5374617475732d436f6d706c6574652d627269676874677265656e"&gt;&lt;img src="https://camo.githubusercontent.com/c4d4c5fb44c08b85ff48097669ae3661f4bac620d1d059881409e63ef6e5b84b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5374617475732d436f6d706c6574652d627269676874677265656e" alt="Status"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;📌 Background&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Alzheimer's disease (AD) is the most common form of dementia, affecting over 55 million people worldwide. While the hallmarks of AD — amyloid plaques and neurofibrillary tangles — are well established, the cell-type-specific molecular changes that drive neurodegeneration remain incompletely understood.&lt;/p&gt;
&lt;p&gt;Single-nucleus RNA sequencing (snRNA-seq) enables transcriptomic profiling of individual cells in post-mortem human brain tissue, making it a powerful tool for dissecting the cellular basis of AD. This project analyses a publicly available snRNA-seq dataset of the human superior frontal gyrus from AD and cognitively normal donors, sourced from the CZ CELLxGENE Discover platform. The dataset contains &lt;strong&gt;63,608 nuclei&lt;/strong&gt; across &lt;strong&gt;7 major brain cell types&lt;/strong&gt; and three Braak stages (0, 2, and 6), enabling analysis of both disease status and progression severity.&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🎯 Objectives&lt;/h2&gt;

&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;Perform quality control, normalisation, and dimensionality…&lt;/li&gt;
&lt;/ul&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/Farhan89082/alzheimers-scrna-analysis" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





&lt;p&gt;If you're working with single-cell data or have questions about the pipeline, I'd love to hear from you in the comments. There's something fascinating about watching biology emerge from a matrix of gene counts.&lt;/p&gt;

</description>
      <category>bioinformatics</category>
      <category>python</category>
      <category>datascience</category>
      <category>beginners</category>
    </item>
  </channel>
</rss>
