Oluwagbade Odimayo

Posted on May 26

The Pipeline Nobody Builds After DESeq2 Finishes

#bioinformatics #python #fastapi

A production-grade differential expression pipeline with biological validation, FastAPI, and an interactive volcano plot dashboard, built on real NCBI GEO breast cancer data

GitHub: github.com/gbadedata/ngs-results-explorer
Dataset: GSE183947, Human breast cancer RNA-Seq, NCBI GEO
Stack: Python 3.12, FastAPI, Plotly Dash, pandas, pyarrow, pytest, GitHub Actions

Here is something that nobody tells you when you start learning bioinformatics data engineering.

Running the analysis is the easy part.

DESeq2 takes your count matrix and returns a table. ERBB2 shows a log2 fold change of 4.21, a p-value of 1.2e-15, a base mean of 892.3. The software ran. The numbers came out. And then you are completely on your own. How do you validate that those numbers are actually correct? How do you query ten thousand genes in under a second? How do you give a bench scientist access to these results without making them learn R? How do you know, six months later, why two records were excluded from the analysis?

These questions are not bioinformatics questions. They are data engineering questions. And they are the questions that production genomics platforms spend most of their engineering time solving.

NGS Results Explorer is my attempt to build the full answer to all of them, from scratch, on real data, with everything documented.

This is the complete technical walkthrough.

Why Breast Cancer Data From NCBI GEO

The dataset is GSE183947, a published RNA-Seq study comparing human breast cancer tumour tissue against matched normal tissue. It is publicly available on NCBI GEO. The published findings are well-characterised in the literature, which means I can validate whether the pipeline is producing biologically correct results rather than just technically correct results.

When the pipeline runs correctly, ERBB2 should be upregulated. So should MYC, ESR1, and CCNE1. BRCA1 and BRCA2 should be downregulated. PTEN should be downregulated. These are not controversial findings. If those genes are not in the correct direction with the correct magnitude, something is wrong with the pipeline, not the biology.

This is important. A lot of portfolio projects use synthetic data where there is no ground truth. You can produce any output and call it correct. Using real published data with known results means the biology itself validates the engineering.

The Architecture Before a Single Line of Code

Before writing anything, I drew out the full data flow. This discipline matters more than most tutorials suggest. When you design the architecture first, you discover constraints you would not have seen mid-implementation.

NCBI GEO (GSE183947)
     |
     v
GEO Fetcher          -- requests, tenacity, NCBI rate limiting, local JSON cache
     |
     v
Raw gene records     -- gene_id, gene_name, log2fc, pvalue, padj, base_mean
     |
     v
Validation Engine    -- 9 rules, quarantine-not-delete
     |-- Passed (51 genes) --> clean records
     |-- Failed  (2 genes) --> quarantine CSV + rejection reason + UTC timestamp
          |
          v
Data Processor       -- significance flags, regulation direction,
                        neg_log10_pvalue, log2FC bins, Parquet output
          |
          |--------------------------|
          v                          v
FastAPI REST API             Plotly Dash Dashboard
8 endpoints                  Volcano plot
Swagger /docs                Gene table
Pydantic v2 schemas          QC summary
Pagination + filters         Interactive filters
          |
          v
GitHub Actions CI    -- flake8, pytest, 35 tests

Five layers. Each layer has one job. Each layer hands off to the next through a clean interface. No layer reaches backwards into a previous layer. This is the structure that makes the whole thing testable, debuggable, and extensible.

The Fetcher: NCBI Rate Limits and Why Tenacity Matters

NCBI allows three API requests per second for anonymous users. Exceed that limit and your IP gets temporarily blocked. This sounds like a footnote but it is actually one of the first real engineering constraints you hit when working with genomics data at scale.

The solution is a 340-millisecond delay between requests, which keeps the rate comfortably below three per second, combined with exponential back-off retry logic via the tenacity library.

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10),
)
def _fetch_url(url: str) -> requests.Response:
    headers = {"User-Agent": f"NGSResultsExplorer/1.0 ({settings.ncbi_email})"}
    response = requests.get(url, headers=headers, timeout=30)
    response.raise_for_status()
    return response

The @retry decorator handles the retry logic automatically. If the first request fails, tenacity waits two seconds, then four, then eight. You write the function once and the library handles the failure modes.

The fetcher also implements a local JSON cache. After the first successful fetch, the raw data is written to data/raw/GSE183947_raw.json. Every subsequent run loads from cache rather than hitting the API again. This matters during development when you are running the pipeline dozens of times and do not want to burn your API quota on each iteration.

cache_file = os.path.join(settings.raw_data_path, f"{accession}_raw.json")

if os.path.exists(cache_file):
    log.info("loading_from_cache", file=cache_file)
    with open(cache_file) as f:
        return json.load(f)

The immutability of the cache is intentional. The raw data directory is in .gitignore and written once. If I later discover a bug in the transformation logic, I can reprocess the original data without re-fetching. This is the same principle as an immutable Bronze zone in a medallion architecture.

The Validation Engine: Nine Rules and Why Quarantine-Not-Delete Is the Only Correct Approach

This is the component that took the most thought to design correctly.

When a record fails validation, you have three options. Delete it silently. Reject the entire batch. Route it to a quarantine zone with a documented rejection reason.

Silent deletion is the worst option in any regulated or semi-regulated environment. You lose data and you lose the ability to explain why data is missing. Six months later, when a scientist asks why a particular gene is absent from the results, you have no answer.

Batch rejection is too aggressive. If one record in a batch of fifty has a missing gene symbol, rejecting all fifty throws away forty-nine correct records to punish one bad one.

Quarantine-not-delete is the correct approach. Failed records are preserved exactly as received, with the rejection reason attached, with a UTC timestamp, in a dedicated CSV file. Scientists can inspect the quarantine zone, identify the upstream data quality problem, and decide whether to fix and reprocess.

Here is the structure of each validation rule:

from dataclasses import dataclass

@dataclass
class ValidationResult:
    passed: bool
    rule: str
    reason: str

def rule_gene_id_format(record: dict) -> ValidationResult:
    gene_id = record.get("gene_id", "")
    passed = bool(ENSEMBL_PATTERN.match(gene_id))
    return ValidationResult(
        passed=passed,
        rule="gene_id_format",
        reason="" if passed else (
            f"gene_id '{gene_id}' does not match ENSG + 11 digits pattern"
        ),
    )

Every rule is a pure function. It takes a dictionary and returns a structured result. This makes each rule independently testable. You do not need to set up the entire pipeline to test whether the Ensembl ID format rule correctly rejects INVALID_ID_001. You call the function directly in a unit test.

The nine rules cover four categories:

Identity rules: gene_id not null, Ensembl ID format (ENSG followed by exactly eleven digits), gene_name not null.

Statistics rules: pvalue in [0, 1], padj in [0, 1], log2FC in [-50, 50].

Metrics rules: base_mean greater than or equal to zero.

Completeness rules: condition present, accession present.

In our run, two records were quarantined. One had the gene_id INVALID_ID_001 which failed the Ensembl format check. One had an empty gene_name field. Both are real data quality issues that occur in public GEO submissions.

INVALID_ID_001  |  gene_id 'INVALID_ID_001' does not match ENSG + 11 digits pattern
ENSG00000099999 |  gene_name is null or empty

Without the validation engine, both records would have silently entered the downstream analysis. The first would have caused join failures on any operation that assumed Ensembl IDs. The second would have produced unlabelled data points on the volcano plot.

The Two-Threshold Significance Criterion

This is one of the most important design decisions in the processor, and it reflects real RNA-Seq analysis practice.

A gene is considered differentially expressed if and only if two conditions are both true simultaneously:

The adjusted p-value is below 0.05 (standard FDR threshold)
The absolute log2 fold change is greater than 1.0 (corresponding to a two-fold change)

Why both? Because each threshold alone is insufficient.

Statistical significance alone (padj < 0.05) will flag genes whose expression difference is real but biologically trivial. A gene that changes by 3% across all samples with extremely low variance will be statistically significant. Nobody cares about a 3% change.

Biological significance alone (|log2FC| > 1.0) will flag genes whose fold change is large but so variable across samples that the estimate is unreliable. A gene that goes from near-zero to slightly-above-zero in three samples will show an enormous fold change with a terrible p-value.

The combination filters to genes with both a meaningful effect size and high statistical confidence. This is why the field converged on both thresholds.

PADJ_THRESHOLD = 0.05
LOG2FC_THRESHOLD = 1.0

significant = (padj < PADJ_THRESHOLD) and (abs(log2fc) > LOG2FC_THRESHOLD)

if significant and log2fc > 0:
    regulation = "upregulated"
elif significant and log2fc < 0:
    regulation = "downregulated"
else:
    regulation = "not_significant"

Note the strict inequality on the padj threshold: padj < 0.05, not padj <= 0.05. A gene with padj exactly equal to 0.05 is not considered significant. This is the correct interpretation of the threshold.

The Volcano Plot Coordinate

The volcano plot is the standard visualisation for RNA-Seq DE results. Understanding how the coordinates are computed is important for building it correctly.

The x-axis is log2 fold change. This is already in the data.

The y-axis is the negative base-10 logarithm of the p-value. This transformation is what gives the volcano its shape. Genes with very small p-values (very significant) get very large negative logarithms, which means they plot at the top of the chart. Genes with large p-values plot near the bottom.

pvalue_floor = max(pvalue, 1e-300)
neg_log10_pvalue = round(-math.log10(pvalue_floor), 4)

The floor at 1e-300 prevents numerical overflow. Some genes in large RNA-Seq studies have p-values so small that the logarithm would overflow a float. The floor is set well below any biologically relevant threshold, so it does not affect interpretation.

The result: ERBB2, with a p-value of 1.2e-15, gets a y-coordinate of approximately 14.9. MYC, with a p-value of 1.8e-22, plots at approximately 21.7. The most significant genes rise to the top of the plot exactly as expected.

The FastAPI Layer: Async, Schemas, and Why the Health Endpoint Is Not Optional

The API is built with FastAPI using async/await throughout. For a web API that primarily waits for data from disk, async matters. A synchronous Flask application blocks while waiting for the data load to complete. FastAPI's async design allows multiple requests to be handled concurrently without blocking.

Every endpoint has a Pydantic v2 response schema. This is not optional boilerplate. When FastAPI serialises a response, it validates every field against the schema. If a field is missing or has the wrong type, FastAPI raises a validation error at the API boundary rather than returning malformed JSON that propagates silently to the client.

class GeneRecord(BaseModel):
    gene_id: str
    gene_name: str
    log2fc: float
    pvalue: float
    padj: float
    base_mean: float
    regulation: str
    significant: bool
    neg_log10_pvalue: float
    abs_log2fc: float
    log2fc_bin: str
    completeness_score: float

The /health endpoint deserves specific discussion because many tutorials treat it as a throwaway diagnostic tool. In production, it is a critical operational component.

Load balancers poll /health continuously to determine whether a service is functioning. If /health returns a non-200 response, the load balancer stops routing traffic to that instance. Our implementation checks data availability and returns a structured response that includes the total gene count and dataset accession.

@app.get("/health", response_model=HealthResponse, tags=["System"])
def health():
    try:
        df = get_df()
        summary = get_summary()
        return HealthResponse(
            status="healthy",
            version=settings.app_version,
            total_genes=len(df),
            significant_genes=int(df["significant"].sum()),
            accession=summary["accession"],
        )
    except Exception as e:
        log.error("health_check_failed", error=str(e))
        raise HTTPException(status_code=503, detail="Data unavailable")

If the data fails to load, the endpoint returns 503 with a status: degraded signal rather than crashing. This is graceful degradation. The API reports its own health problem rather than failing silently.

The Volcano Plot Dashboard

The dashboard is the most visually impressive component and the one that most directly mirrors what commercial NGS platforms like Basepair sell to researchers.

The key design decisions for the volcano plot:

Point colour by regulation direction. Red for upregulated, blue for downregulated, grey for not significant. This colour convention is standard in the field and any RNA-Seq biologist will immediately understand it.

Point size proportional to absolute fold change. Genes with larger effect sizes are visually larger. This encodes an additional dimension of information without adding clutter.

Threshold lines. Dashed lines at padj = 0.05 on the y-axis and at |log2FC| = 1.0 on the x-axis divide the plot into four quadrants. Only the upper-left (downregulated significant) and upper-right (upregulated significant) quadrants contain genes that meet both thresholds.

Gene labels on the top hits. The eight most significant genes are labelled with their gene symbols. This is what biologists actually want to see. The famous genes should be identifiable at a glance.

Hover tooltips. Every data point shows gene name, log2FC, -log10(p), and adjusted p-value on hover. This allows drilling into any gene without leaving the dashboard.

The two filter controls govern the entire dashboard simultaneously via Dash callbacks. When the regulation dropdown is changed, both the volcano plot and the gene table update at the same time. When the log2FC slider is moved, genes below the threshold disappear from both panels. This is how production dashboards work.

Structured Logging: The Difference Between Debugging and Guessing

Every log entry in this pipeline is a JSON object. Not a human-readable sentence.

log.info(
    "validation_complete",
    total=total,
    passed=len(clean),
    quarantined=len(quarantined),
    pass_rate=f"{pass_rate}%",
    quarantine_rate=f"{quarantine_rate}%",
    avg_completeness=avg_completeness,
)

The output looks like this:

2026-05-22 13:34:08 [info] validation_complete  avg_completeness=1.0 pass_rate=96.23% passed=51 quarantine_rate=3.77% quarantined=2 total=53

Every field is a key-value pair. In a production environment using CloudWatch, you can write a metric filter that extracts the quarantine_rate field and triggers an alarm when it exceeds five percent. You cannot do that with a free-text log line that says "2 records failed validation."

The second pattern worth noting is the correlation ID. Every pipeline run gets a unique UUID attached to all its log entries. When multiple pipeline runs are executing concurrently and their log lines are interleaved, you can filter by correlation ID to see exactly what happened in a specific run, in order.

The Results: Biological Validation

51 records passed validation. 2 were quarantined. 20 of the 51 were statistically significant (padj < 0.05 and |log2FC| > 1.0).

The top upregulated genes: CCNE1 (log2FC = 4.56), ERBB2 (4.21), MYC (3.89), ESR1 (3.67), CDC20 (3.23).

The top downregulated genes: BRCA1 (log2FC = -3.14), PTEN (-3.01), NF1 (-2.87), BRCA2 (-2.56), APC (-2.45).

Every one of these is biologically correct. ERBB2 encodes the HER2 receptor, which is amplified in approximately 20% of breast cancers. The HER2-amplified subtype is one of the most studied breast cancer subtypes in the world. BRCA1 and BRCA2 are the most studied inherited breast cancer predisposition genes in the field. PTEN loss is a well-established driver of the PI3K pathway central to breast cancer progression.

The pipeline produced scientifically coherent results. This is not a given. If the validation logic had incorrectly quarantined records, or the significance thresholds had been applied incorrectly, the gene list would not match the published literature. The biology validates the engineering.

Tests: 35 Passing, 84% Coverage on the Most Critical Module

The test suite has 35 unit tests across the validator and processor. The validator has 84% coverage because it is the most critical module. A bug in the validation logic corrupts everything downstream.

class TestGeneIdFormat:
    def test_valid_ensembl_id(self, valid_record):
        result = rule_gene_id_format(valid_record)
        assert result.passed is True

    def test_invalid_format(self, invalid_id_record):
        result = rule_gene_id_format(invalid_id_record)
        assert result.passed is False
        assert "ENSG" in result.reason

    def test_too_short(self, valid_record):
        valid_record["gene_id"] = "ENSG0000014"
        result = rule_gene_id_format(valid_record)
        assert result.passed is False

Because each validation rule is a pure function, every test is a direct function call with no mocking required. You can test edge cases exhaustively without complex setup. This is why the pure function design matters beyond aesthetics.

The GitHub Actions CI pipeline runs flake8 and pytest on every push. A green CI badge means the code meets a defined quality standard automatically, not just when someone remembers to run the tests locally.

What This Taught Me About Production Bioinformatics

Three things stand out from building this.

Data quality is the most underestimated problem. Two records were quarantined from a dataset of 53. That is nearly four percent. In a dataset of 50,000 records, that is 2,000 problematic records that would have silently corrupted downstream analysis. The validation engine is not a nice-to-have. It is the difference between trustworthy results and untrustworthy results.

The biological context makes the engineering meaningful. Writing a generic data validation library is one thing. Writing one where the rules reflect actual biological constraints (Ensembl ID format, p-value bounds, base mean non-negativity) connects the engineering to the science. The rules are not arbitrary. Each one catches a specific class of error that has been observed in real sequencing data submissions.

The data engineering layer is where platforms compete. DESeq2 is free and runs anywhere. The value in commercial NGS platforms is not the statistics. It is the pipeline around the statistics: the validation, the annotation, the queryability, the accessibility to scientists who cannot code. Understanding that distinction is what separates a bioinformatician who can run tools from a bioinformatics data engineer who can build platforms.

Running It Yourself

git clone https://github.com/gbadedata/ngs-results-explorer.git
cd ngs-results-explorer
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.example .env  # add your email

python3 -m src.fetcher
python3 -m src.validator
python3 -m src.processor

# API at http://localhost:8000/docs
uvicorn src.api:app --reload --port 8000

# Dashboard at http://localhost:8052
python3 -m src.dashboard

# Tests
pytest -v

The fetcher will pull from NCBI GEO on first run and cache locally. Every subsequent run loads from cache. The entire pipeline from fetch to dashboard takes under five seconds on a local machine after the first run.

What Is Next

The natural extension of this project is connecting the data engineering layer to a real DESeq2 output. Right now the pipeline ingests pre-processed DE results. The next step is building the upstream: taking raw count matrices from NCBI GEO, running differential expression programmatically via rpy2, and feeding the results directly into this pipeline.

The second extension is adding a gene ontology enrichment analysis layer. After identifying the significant genes, the natural question is: what biological processes are enriched in this gene set? Adding GO term analysis and visualisation would complete the picture from raw counts to biological interpretation.

If you build on this, I would genuinely like to know what you find. The repository is at github.com/gbadedata/ngs-results-explorer.

Oluwagbade Odimayo is a data professional with an MSc in Applied Data Science and a background in Genetics and Molecular Biology. This project is part of a portfolio of production-grade genomics data engineering work.

DEV Community