DEV Community

Cover image for Spatial Transcriptomics Plus Topological Data Analysis: A Complete End-to-End Tutorial with squidpy and GUDHI
Oluwagbade Odimayo
Oluwagbade Odimayo

Posted on

Spatial Transcriptomics Plus Topological Data Analysis: A Complete End-to-End Tutorial with squidpy and GUDHI

Build a real spatial transcriptomics pipeline on Visium mouse brain, then go one step past every tutorial: use persistent homology to find spatial structure that Moran's I misses. Every number here is reproducible from a fresh clone.


Most spatial transcriptomics tutorials end at the same place. You load a Visium dataset, run Moran's I, colour a few genes on the tissue, and admire the picture. You never find out whether the standard spatial statistic actually captured everything that matters.

This tutorial goes further. Complete spatial pipeline was built with squidpy, then add a second lens that almost no tutorial covers: topological data analysis with GUDHI. By the end you will have a working benchmark that names a specific gene whose spatial structure Moran's I underestimates, and you will understand exactly what that claim does and does not mean.


What we are building

Three tasks, each answering a question a real spatial analysis should ask:

  1. Spatially variable gene detection. Does Moran's I recover the genes we know are spatially patterned?
  2. Neighbourhood enrichment. Do anatomically adjacent brain regions actually sit together in the data?
  3. TDA versus Moran's I. Does persistent homology see structure that autocorrelation misses?

The dataset is the canonical V1 Adult Mouse Brain Visium section: 2,688 spots, 15 annotated anatomical regions, about 18,000 genes.

A quick note on what each task is for. Tasks 1 and 2 are validation: they prove the pipeline reproduces known biology, so you can trust it. Task 3 is the contribution: it shows the standard method has a blind spot and that topology fills it.


Setup

python3 -m venv .venv && source .venv/bin/activate
pip install squidpy gudhi scanpy anndata scipy matplotlib pandas
Enter fullscreen mode Exit fullscreen mode

A note on versions: this tutorial targets squidpy 1.8.x, which renamed several parameters from older releases. If you copy from a 2022 era tutorial you will hit errors. I flag each one inline.


Step 1: Load the data

squidpy ships the dataset, so there is nothing to download manually:

import squidpy as sq

adata = sq.datasets.visium_hne_adata()
print(adata)
# AnnData object with n_obs x n_vars = 2688 x 18078
#   obs: 'cluster'        (15 anatomical region labels)
#   obsm: 'spatial'       (x, y coordinates on the tissue)
Enter fullscreen mode Exit fullscreen mode

Two things make this spatial data. obsm['spatial'] holds the physical coordinates of each spot on the tissue slide, and obs['cluster'] holds the manually annotated brain region for each spot. Those region labels are our ground truth in Tasks 1 and 2.


Step 2: Preprocess and build the spatial graph

Standard scanpy preprocessing, with one critical addition at the end, the spatial neighbourhood graph.

import scanpy as sc

# Filter, normalise, log transform
sc.pp.filter_genes(adata, min_cells=10)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.layers["log_norm"] = adata.X.copy()   # keep a clean log norm copy

# Standard embedding
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
sc.pp.pca(adata, n_comps=30)

# The spatial graph. This is what makes squidpy "spatial"
sq.gr.spatial_neighbors(adata, coord_type="grid", n_rings=1)
Enter fullscreen mode Exit fullscreen mode

squidpy 1.8.x gotcha number 1: older tutorials use coord_type="visium". That value no longer exists. For Visium's hexagonal array you now use coord_type="grid" with n_rings=1, which connects each spot to its roughly six immediate neighbours.

After this you have about 15,580 edges, roughly 5.8 neighbours per interior spot. That spatial graph is the foundation for both Moran's I and neighbourhood enrichment.

Keeping the log_norm layer matters later: the TDA step reads expression from it, so we want a clean copy taken before any scaling.


Step 3: Task 1, spatially variable genes with Moran's I

Moran's I asks one question per gene: do spatially adjacent spots have similar expression? It returns a value roughly between 0 and 1. High Moran's I means the gene's expression is spatially organised rather than scattered at random.

sq.gr.spatial_autocorr(
    adata,
    mode="moran",
    genes=adata.var_names.tolist(),
    n_perms=100,
    corr_method="fdr_bh",       # gotcha number 2: was "correction" pre 1.8
    show_progress_bar=False,
)

moran = adata.uns["moranI"].sort_values("I", ascending=False)
print(moran.head())
#          I        pval_norm   pval_norm_fdr_bh
# Mbp      0.788    ...
# Slc17a7  0.775    ...
# Nrgn     0.743    ...
# Cck      0.727    ...
# Itpka    0.698    ...
Enter fullscreen mode Exit fullscreen mode

squidpy 1.8.x gotcha number 2: the false discovery rate parameter is now corr_method, not correction. Results land in adata.uns["moranI"] with columns I, pval_norm, and pval_norm_fdr_bh.

The top gene is Mbp at I = 0.788, a myelin marker sharply concentrated in the white matter fibre tract. That is exactly what we expect. The sharpest spatial domain in the section ranks first.

Now validate it against known biology. A high Moran's I is only convincing if the genes it promotes are genuinely the ones biology says are spatially patterned. I use a curated panel of Allen Brain Atlas region markers as an oracle. Here is the complete panel, so you can reproduce the exact number:

oracle_markers = {
    "Layer 1":      ["Reln", "Ndnf", "Cxcl14"],
    "Layer 2/3":    ["Cux1", "Cux2", "Calb1"],
    "Layer 4":      ["Rorb", "Scnn1a", "Rspo1"],
    "Layer 5":      ["Bcl11b", "Fezf2", "Etv1"],
    "Layer 6":      ["Ntsr1", "Syt6", "Ctgf"],
    "White matter": ["Mbp", "Mog", "Plp1"],
    "Hippocampus":  ["Prox1", "Dkk3", "C1ql2"],
}

# Flatten, keep only markers that survived HVG filtering
all_markers = [g for grp in oracle_markers.values() for g in grp]
tested = [g for g in all_markers if g in adata.var_names]   # 17 of 21 survive
top_500 = set(moran.head(500).index)
found = [m for m in tested if m in top_500]

sensitivity = len(found) / len(tested)
print(f"tested={len(tested)}  found={len(found)}  sensitivity={sensitivity:.2f}")
# tested=17  found=7  sensitivity=0.41
Enter fullscreen mode Exit fullscreen mode

Sensitivity is 0.41: seven of the seventeen testable markers appear in the top 500 SVGs. The seven it finds are Mbp, Mog, Plp1 (white matter), Dkk3 (hippocampus), and the strong cortical markers Calb1, Fezf2, C1ql2.

Which markers does it miss, and why? This is the important part. Print their ranks:

missed = [m for m in tested if m not in found]
for m in missed:
    rank = moran.index.get_loc(m) + 1
    print(f"{m:10s} rank {rank:5d}/{len(moran)}   I={moran.loc[m,'I']:.3f}")
# Cux2       rank   948   I=0.191
# Reln       rank   989   I=0.181
# Rorb       rank   996   I=0.179
# ...
Enter fullscreen mode Exit fullscreen mode

The missed markers are smooth, low amplitude cortical layer gradients. Cux2 (rank 948), Reln (989), and Rorb (996) are real layer markers, but their expression changes gradually across the cortex rather than forming a sharp boundary.

This is not a bug. Moran's I ranks sharp concentrated domains above broad gradients, because a smooth gradient has lower local spot to spot contrast. Hold that thought. It is the entire reason for Step 5.


Step 4: Task 2, neighbourhood enrichment

Do brain regions that are anatomically adjacent actually co-localise in the data? squidpy runs a permutation test on the spatial graph:

sq.gr.nhood_enrichment(adata, cluster_key="cluster", show_progress_bar=False)
z = adata.uns["cluster_nhood_enrichment"]["zscore"]   # 15 x 15 z score matrix
Enter fullscreen mode Exit fullscreen mode

For every pair of regions, this returns a z score: how much more adjacent they are than you would expect by chance, given the size of each region. A z score above about 1 is genuine enrichment; a negative z score means the two regions are more separated than chance.

I validate five pairs that real neuroanatomy says should co-localise:

expected_pairs = [
    ("Hippocampus",       "Pyramidal_layer"),                 # z = 24.9
    ("Cortex_4",          "Cortex_5"),                         # z = 14.7
    ("Hippocampus",       "Pyramidal_layer_dentate_gyrus"),   # z = 13.3
    ("Lateral_ventricle", "Striatum"),                        # z =  5.1
    ("Fiber_tract",       "Lateral_ventricle"),               # z =  5.0
]
Enter fullscreen mode Exit fullscreen mode

All five enrich strongly. The hippocampal complex lights up at z = 24.9.

A trap worth knowing about. My first attempt validated pairs like Thalamus_1 plus Thalamus_2, assuming the shared name meant they sit together. They do not. Their z score is about minus 3.8. Those numbered subclusters are transcriptionally distinct territories in separate parts of the tissue. They are spatially segregated, not adjacent.

The lesson is concrete: spatial adjacency and statistical neighbourhood enrichment are different things. Trust the enrichment matrix, not the cluster names. I corrected the validated pairs to reflect true anatomy, and the matrix confirmed every one.


Step 5: Task 3, where TDA sees what Moran's I cannot

Here is the step no tutorial covers, and the reason this project exists.

Moran's I is a single number measuring local pairwise similarity. It is blind to the shape of an expression pattern. A gene whose expression forms a ring, or a reticular multi focal network, has rich spatial structure that pairwise correlation barely registers.

Persistent homology measures shape. Before the code, the one paragraph of theory you actually need:

Imagine each expressing spot as a point on the tissue. Now grow a disc of radius r around every point and let the discs merge as r increases. At small r you have many separate blobs; as r grows they connect, and sometimes a ring of points encloses an empty region before finally filling in. Persistent homology records, across every value of r at once, how many connected pieces exist (called H0) and how many enclosed loops exist (called H1), and crucially how long each one survives as r grows. A loop that persists over a wide range of r is a real, robust ring in the data. A loop that appears and vanishes immediately is noise. That construction, discs growing over a point cloud, is a Vietoris Rips filtration, and GUDHI computes it for us.

Build a point cloud per gene

For each gene, take the spots where it is expressed and treat their coordinates as a point cloud:

import numpy as np
import gudhi as gd

def gene_point_cloud(adata, gene, max_points=250):
    idx = adata.var_names.get_loc(gene)
    expr = adata.layers["log_norm"][:, idx]
    expr = np.asarray(expr.todense()).ravel() if hasattr(expr, "todense") else expr

    mask = expr > np.median(expr[expr > 0])    # spots expressing above median
    coords = adata.obsm["spatial"][mask].astype(float)

    # Subsample dense clouds. Critical for memory, see below
    if len(coords) > max_points:
        rng = np.random.default_rng(42)
        coords = coords[rng.choice(len(coords), max_points, replace=False)]

    # Normalise to the unit square so the filtration scale is comparable across genes
    coords -= coords.min(0)
    coords /= coords.max(0).clip(min=1e-9)
    return coords
Enter fullscreen mode Exit fullscreen mode

Memory gotcha number 3: a Vietoris Rips complex grows roughly cubically with point count. A gene expressed in 800 spots generates tens of millions of simplices and gets killed by the operating system's out of memory killer. Subsampling to 250 points keeps the topology intact while bounding memory. Because the final analysis uses ranks rather than raw values, it is robust to the subsample.

Compute persistence and summarise it

def h1_entropy(coords, max_edge=1.5, min_persistence=0.01):
    rips = gd.RipsComplex(points=coords, max_edge_length=max_edge)
    st = rips.create_simplex_tree(max_dimension=2)
    st.compute_persistence()

    # Keep H1 (loops) that survive longer than the noise floor
    h1 = [(b, d) for dim, (b, d) in st.persistence()
          if dim == 1 and (d - b) >= min_persistence]
    if not h1:
        return 0.0

    lifetimes = np.array([d - b for b, d in h1])
    p = lifetimes / lifetimes.sum()
    return float(-(p * np.log(p)).sum())     # persistence entropy
Enter fullscreen mode Exit fullscreen mode

Persistence entropy is one scalar summarising the whole H1 diagram: it is high when a gene has many loops of comparable importance and low when it has few loops or one dominant one. It is a reasonable single number for "topological complexity", though it is not the only choice, and I will name its limits below.

Threshold gotcha number 4, the one that nearly fooled me. My first run returned H1 entropy of exactly 0.0 for every gene, including the most structured. The output was clean and uniform, and it was tempting to accept.

It was completely wrong. GUDHI was finding more than twenty genuine loops in Mbp, but my min_persistence=0.1 filter discarded all of them. Spatial loops in unit normalised point clouds are short lived: they live in the 0.01 to 0.1 persistence band. Lowering the threshold to 0.01 recovered them.

The takeaway is an instinct worth more than the fix: distrust a result that looks too clean. Zero complexity on a visibly structured dataset is a red flag, not a pass.

Compare the two rankings

Rank a panel of genes both ways, by Moran's I and by H1 entropy, then look for disagreement:

import pandas as pd

panel = moran.head(18).index   # mix of high, mid, low Moran's I genes
rows = []
for g in panel:
    rows.append({
        "gene": g,
        "morans_i":   moran.loc[g, "I"],
        "h1_entropy": h1_entropy(gene_point_cloud(adata, g)),
    })

df = pd.DataFrame(rows).set_index("gene")
df["morans_rank"] = df["morans_i"].rank(ascending=False)
df["tda_rank"]    = df["h1_entropy"].rank(ascending=False)
df["divergence"]  = (df["morans_rank"] - df["tda_rank"]).abs()
print(df.sort_values("divergence", ascending=False).head())
Enter fullscreen mode Exit fullscreen mode

Your output will look like this:

          morans_i  h1_entropy  morans_rank  tda_rank  divergence
gene
Apbb2        0.087       3.340          8.0       1.0         7.0
Gpr17        0.087       3.154          6.0       2.0         4.0
Mbp          0.788       3.035          1.0       8.0         7.0
Cck          0.727       2.905          4.0       9.0         5.0
Enter fullscreen mode Exit fullscreen mode

The result, stated precisely

Apbb2 is the standout. Moran's I ranks it around tenth in the wider gene set; by H1 persistence entropy it ranks first, the highest topological complexity in the panel. Gpr17 shows the same direction of divergence. The mirror image holds for Mbp: Moran's rank 1, the sharpest domain in the whole section, but only TDA rank 8, because a single solid blob has strong autocorrelation and simple topology.

Here is the honest framing, because it is what a careful reviewer will probe. What I have measured is that Apbb2 has a high H1 persistence entropy signal, and that this signal disagrees with its Moran's I rank. That is a real, reproducible methodological result. What I have not done is confirm by independent histology that Apbb2 forms a literal anatomical ring. The topology is evidence of loop like structure in the expressing spots, not proof of a biological annulus. Stating that boundary precisely is part of the result, not a hedge.

One more honest caveat. The H1 entropy values across the panel cluster fairly tightly, mostly between about 2.9 and 3.3. The ranking is stable and reproducible, but the absolute separation between genes is modest. The correct reading is "the two methods order genes differently, with Apbb2 the clearest case", not "TDA found a dramatic outlier Moran's I completely missed". The honest version is the more defensible one.


What the figures show

The persistence diagram for Apbb2 makes the construction concrete. One H0 component spans the full filtration, the single connected tissue domain, while the rest die near the diagonal. Thirty two H1 loops sit in the 0.05 to 0.2 band, exactly the short lived features the threshold fix recovered. If you ever see an empty H1 diagram on real tissue, you have a threshold bug, not a structureless gene.

The Moran versus TDA scatter places each gene by its two ranks. Points on the diagonal are genes the two methods agree about. Apbb2 and Gpr17 sit well below the diagonal, low Moran's rank but high TDA rank. That visible distance from the diagonal is the finding.


The five gotchas, collected

If you build this yourself, these will save you hours:

  1. coord_type="grid", not "visium" (squidpy 1.8.x)
  2. corr_method=, not correction= (squidpy 1.8.x)
  3. Subsample dense point clouds before the Rips complex, or run out of memory
  4. Set min_persistence low, around 0.01, because real spatial loops are short lived
  5. Validate neighbourhood pairs against the enrichment matrix, not against shared cluster names

Why this matters beyond the tutorial

Anyone can run Moran's I. The skill that separates a spatial genomics engineer from a tutorial follower is knowing the method's blind spots and reaching for the right complementary tool, then being precise about what the complement does and does not prove.

Persistent homology is that complement for loop rich and multi focal expression patterns. It is not a replacement for Moran's I. It is a second lens that catches a different kind of structure, and now you have a working, validated, honestly caveated example that names a real gene.

Full code: github.com/gbadedata/squidpy-spatial-benchmark. Runtime is about 105 seconds end to end.

The full pipeline, 135 tests, continuous integration, structured logging, and a unified JSON benchmark report, is on GitHub:

github.com/gbadedata/squidpy-spatial-benchmark


Building spatial transcriptomics tools, topological data analysis, or single cell evaluation frameworks? Connect on GitHub or LinkedIn.

Top comments (0)