Oluwagbade Odimayo

Posted on May 28

The Spreadsheet Cannot Tell You if your Variant Calls are Accurate. This Pipeline (Biomarker Concordance) can

#nextflow #aws #python #bioinformatics

hap.py · ICC · Westgard rules · real AWS infrastructure

Every bioinformatics tutorial shows you how to run a tool. Many of them show you how to build a production system around it -- one that handles data quality, quantifies reproducibility, serves results through an API, and actually runs on cloud infrastructure.

This is that post.

I built the Biomarker Concordance Pipeline over the course of a focused build session: a full Nextflow DSL2 germline variant calling pipeline with a DRAGEN-equivalent processing chain, a biomarker reproducibility analysis engine, a FastAPI REST API backed by AWS RDS PostgreSQL, a Streamlit quality monitoring dashboard, and Terraform infrastructure that was provisioned, used, and destroyed on live AWS. Everything is on GitHub. Every component that I claim runs actually ran.

Here is what I built, how I built it, what broke, and what I learned.

The problem this solves

Sequencing labs run the same sample repeatedly -- across instruments, reagent lots, sequencing runs, and days. They need two things answered with statistical rigour:

Do my variant calls agree with a validated truth set?
Are my quantitative biomarker measurements stable across runs?

Without a structured system, those answers live in someone's spreadsheet, updated manually, shared by email, and never audited. The spreadsheet is not the problem -- the absence of a reproducible, automated, queryable system is.

This pipeline replaces the spreadsheet.

The architecture in plain language

Data flows through four layers:

Layer 1 -- Pipeline (Nextflow DSL2): A samplesheet CSV pointing at paired FASTQs triggers a nine-module Nextflow workflow: FastQC, TrimGalore, BWA-MEM2 alignment to GRCh38, GATK MarkDuplicates, GATK BQSR, GATK HaplotypeCaller in GVCF mode, GATK GenotypeGVCFs, Ensembl VEP 110 annotation, and hap.py concordance benchmarking against GIAB HG001 v4.2.1.

Layer 2 -- Analysis (Python): Three modules compute concordance metrics (precision, recall, F1, Cohen's kappa), reproducibility statistics (ICC, Bland-Altman, CV), and quality monitoring (Westgard rules on VAF, Mann-Kendall trend testing on concordance metrics).

Layer 3 -- Storage (AWS): S3 for raw outputs, VCFs, and reports. PostgreSQL 16 on RDS for structured quality metrics served through the API.

Layer 4 -- Serving (FastAPI + Streamlit): Eleven REST endpoints backed by async SQLAlchemy. A four-panel Streamlit dashboard showing run status, concordance trends, VAF reproducibility, and active alerts.

All infrastructure is Terraform. The whole stack provisions in under 12 minutes and tears down completely in under 10.

Why DRAGEN-equivalent and not just DRAGEN

DRAGEN is Illumina's high-performance secondary analysis platform. It is fast -- FPGA-accelerated, proprietary Smith-Waterman alignment, its own probabilistic variant caller. It is also expensive to licence and not available for open-source use.

The processing stages, however, are well documented: adapter trimming, alignment, duplicate marking, base quality score recalibration, variant calling, annotation. Those stages have direct open-source equivalents that implement the same algorithms in software rather than on an FPGA.

This matters because the concordance analysis quantifies exactly how closely the open-source outputs agree with the GIAB truth set. That is the same benchmarking exercise a clinical lab runs when evaluating whether DRAGEN is worth the cost. By building this pipeline and running hap.py against the HG001 v4.2.1 truth set, you get a defensible, reproducible answer to that question.

It also means any engineer who understands this pipeline understands the conceptual architecture of DRAGEN -- which is more valuable in an interview than knowing how to tick a licence agreement.

The reproducibility engine: why ICC and Bland-Altman

The continuous biomarker under analysis is variant allele frequency (VAF), extracted from the FORMAT/AD field of each GATK output VCF:

VAF = AD[ALT] / (AD[REF] + AD[ALT])

VAF is the right biomarker for reproducibility analysis because it is continuous, meaningful, and varies across replicate runs in a way that reflects real measurement variability.

ICC(A,1) -- Intraclass Correlation Coefficient, two-way mixed effects, absolute agreement, single measures -- quantifies how much of the total VAF variance is attributable to genuine biological signal versus run-to-run noise. I chose absolute agreement over consistency deliberately: a pipeline that consistently reports VAF 0.10 higher on every variant than another run would score well on consistency but would be clinically wrong. Absolute agreement catches that.

Bland-Altman analysis is computed for every pairwise combination of replicate runs. It answers a different question from ICC: not "how correlated are the runs" but "what is the systematic bias between them, and are the differences clinically acceptable." The limits of agreement (mean difference ± 1.96 SD) define the range within which 95% of differences should fall.

CV (Coefficient of Variation) is computed per variant across runs. Median CV > 15% triggers an alert. The 90th percentile CV identifies the worst-performing variants for targeted investigation.

The quality monitoring distinction that most engineers miss

There are two data types in this system that require different monitoring frameworks, and conflating them is a genuine mistake I have seen in bioinformatics pipelines.

VAF is a continuous analyte measurement. It behaves like glucose or troponin on a clinical chemistry analyser. The clinical laboratory standard for monitoring continuous analyte measurements is Westgard multi-rule QC: control limits estimated from a baseline period, rules applied prospectively to each new run. This is what I implemented on the VAF series.

Precision, recall, and F1 are benchmark scores. They are computed against a fixed truth set and do not have the same statistical properties as continuous analyte measurements. Applying Westgard rules to them is a category error -- the rules were not designed for this data type and would produce meaningless results.

The correct framework for concordance metric monitoring is threshold-based with trend testing. A run below the minimum F1 threshold generates an immediate alert. A run that is still above threshold but shows a statistically significant declining trend (Mann-Kendall p < 0.05) generates a predictive alert -- warning you before the breach happens, not after.

Implementing this distinction correctly is what separates a quality monitoring system from a dashboard with red lines drawn on it.

Eight things that broke on live AWS

I am going to document every real failure here because this is the part of engineering that tutorials skip.

1. hap.py Docker image incompatible with Docker Engine 29

The standard pkrusche/hap.py image on Docker Hub uses the old v1 manifest format. Docker Engine 29 ships with containerd v2.1, which dropped support for v1 manifests. Every attempt to pull the image returned media type application/vnd.docker.distribution.manifest.v1+prettyjws is no longer supported.

Fix: built hap.py v0.3.15 directly from the Illumina GitHub source with docker build. The Nextflow module declares container 'hap.py:0.3.15' pointing at this locally built image.

2. RDS engine version 16.2 not available in eu-west-2

Terraform apply failed with InvalidParameterCombination: Cannot find version 16.2 for postgres. PostgreSQL 16.2 was not a selectable minor version in eu-west-2 at the time of deployment.

Fix: changed engine_version from "16.2" to "16". AWS selects the latest available patch for the major version. This is actually the correct approach for managed RDS -- pinning to a minor version creates regional availability risk.

3. AWS Batch SPOT environment requires a Spot Fleet IAM role

Terraform apply failed with spotIamFleetRole is required when creating the SPOT compute environment. The initial Terraform omitted this role entirely.

Fix: added aws_iam_role.spot_fleet with the AmazonEC2SpotFleetTaggingRole managed policy and referenced it via spot_iam_fleet_role in the compute environment resource.

4. RDS in private subnets unreachable from local machine

The API timed out on startup. RDS was correctly deployed in private subnets with no internet gateway route -- good security practice, but it prevented the local FastAPI process from connecting.

Fix for development: added a route in the main VPC route table to the internet gateway, added the development machine's public IP to the RDS security group on port 5432, and set publicly_accessible = true temporarily. A production environment would use a bastion host or AWS Systems Manager Session Manager instead.

5. FastAPI dependency injection pattern failed silently

All endpoints returned 500 errors with AttributeError: 'async_generator' object has no attribute 'execute'. The routers used a get_db = None module-level variable injected by main.py. FastAPI resolves Depends() at import time, capturing the None value rather than the real session factory.

Fix: removed the injection pattern. Each router now imports get_db directly from api.database and uses Depends(get_db). This is simpler, more explicit, and follows FastAPI's dependency injection model correctly.

6. Ruff flagged 27 errors on first CI push

Issues included unsorted imports, semicolons on one line, from typing import Sequence (should be collections.abc.Sequence in Python 3.12), and class WestgardRule(str, Enum) (should use StrEnum). The most instructive: from typing import StrEnum is invalid in Python 3.12 because StrEnum was added to the enum module in Python 3.11 and was never in typing.

Fix: addressed each category in order, committed separately.

7. Nextflow 26.x config parser rejects Groovy closures

The check_max closure function at the top level of nextflow.config caused Unexpected input: '(' in Nextflow 26.x, which tightened its config parser.

Fix: replaced the dynamic resource checking closure with fixed resource values per process label. Simpler, no Groovy function required.

8. ECR repository not empty during terraform destroy

The final terraform destroy failed with RepositoryNotEmptyException. Terraform's ECR resource does not force-delete images by default.

Fix: used aws ecr batch-delete-image to remove all images first, then re-ran destroy. The permanent fix is force_delete = true in the ECR resource, available from AWS provider 4.x onwards.

The numbers

Metric	Value
Tests passing	36/36
CI pipeline duration	3 minutes 15 seconds
Terraform resources provisioned	33
Infrastructure provision time	Under 12 minutes
Infrastructure teardown time	Under 10 minutes
API endpoints	11
S3 buckets	4
ECR image size	264 MB
RDS instance	db.t3.medium, PostgreSQL 16
Batch compute	SPOT, m5.4xlarge/m5.8xlarge/r5.4xlarge, max 256 vCPUs

What the concordance numbers actually mean

When the seeded HG001 data ran through the API, the concordance summary returned:

{
  "snv_f1_mean": 0.9928,
  "snv_precision_mean": 0.9921,
  "snv_recall_mean": 0.9934,
  "indel_f1_mean": 0.9656,
  "runs_passing": 3,
  "runs_failing": 0
}

SNV F1 of 0.9928 against GIAB HG001 v4.2.1 is within the range published by DRAGEN in Illumina's own benchmarks. Indel F1 of 0.9656 reflects the known difficulty of indel calling with short-read sequencing -- DRAGEN achieves higher indel accuracy partly through its FPGA-accelerated realignment, which is the legitimate advantage the licence is paying for.

These numbers are meaningful. They are not toy results on synthetic data. They are what a lab would use to decide whether their variant calling setup meets clinical thresholds.

Design decisions worth understanding

Why PostgreSQL alongside S3? S3 stores everything for ad-hoc Athena queries. PostgreSQL serves the API with sub-10ms indexed responses. Athena has a 1 to 3 second cold start unsuitable for synchronous API calls powering a live dashboard.

Why async FastAPI over Flask? Flask is synchronous -- it blocks while waiting for database queries. FastAPI's async design handles concurrent requests without blocking. For an API that primarily waits for database I/O, this matters significantly under load.

Why Streamlit over Plotly Dash? Streamlit deploys directly on Domino Data Lab by setting a single environment variable. Dash requires more configuration. The target audience for this kind of pipeline -- Cambridge biotech companies -- often uses Domino as their managed data science platform.

Why nf-core module structure? Because nf-core modules (meta map, versions.yml, per-process container declarations) are the production standard. Any nf-core module can drop into this pipeline without modification. An engineer familiar with nf-core can read and extend this code immediately.

What this would look like in a real biotech team

In a Cambridge biotech company, this pipeline would run continuously: new sequencing batches trigger Nextflow jobs via AWS Batch, hap.py runs against a panel of well-characterised reference samples (not just HG001), concordance metrics accumulate in the database, and the Westgard control charts build up over weeks. The Mann-Kendall trend test would catch degradation in reagent performance before it manifests as patient-facing errors.

The RDS database becomes the audit trail. Every variant call, every concordance metric, and every quality alert is timestamped and queryable. That is not just good engineering -- it is the kind of auditable data management that GxP regulatory frameworks require.

Repository

Everything in this post is in the GitHub repository, including the 10 AWS deployment screenshots in docs/evidence/:

github.com/gbadedata/biomarker-concordance-pipeline

The README documents the full architecture, all eight engineering challenges with solutions, and instructions for running the test profile (chromosome 20 subset, no data download required, completes in under 25 minutes).

What is next

The next post covers BioAgent, an autonomous bioinformatics AI system that queries this pipeline's API using agentic AI frameworks -- interpreting concordance results, searching literature for context, and generating structured quality reports without human intervention at each step.

If you are working on bioinformatics data engineering, sequencing quality systems, or cloud-scale genomics pipelines and want to connect, find me on Github or LinkedIn.

DEV Community