Oluwagbade Odimayo

Posted on Jun 15

From Variant CSV to Review-Ready Report: A Python Workflow With Docker and GitHub Actions

#cicd #datascience #docker #python

Variant prioritisation often starts with a table.

But a table alone does not answer the most important question:

Which variants deserve closer review, and why?

The ClinVar Variant Prioritisation Workflow was built to answer that question with transparent scoring, validation, reporting, Docker, and CI.

Repository:

GitHub

Tech Stack

Python
pandas
Pydantic
matplotlib
pytest
Make
Docker
GitHub Actions
mamba

What the Workflow Does

The workflow takes a curated inherited-disease variant dataset and ranks variants using transparent evidence rules.

Each variant receives:

priority score out of 100
priority tier
ranked output
review recommendation

Dataset Fields

The curated dataset includes:

variant_id
gene
chromosome
position
reference
alternate
consequence
clinvar_significance
review_status
allele_frequency
inheritance
phenotype_match_score
computational_score
disease_area

Validation Layer

Before scoring, the workflow checks:

required columns
valid allele frequency values
valid phenotype match score range
valid computational score range
record schema consistency

Pydantic is used for schema validation.

This prevents the scoring logic from running on malformed records.

Scoring Framework

The score is out of 100:

ClinVar-style significance: 30
Review status: 15
Variant consequence: 20
Allele frequency rarity: 15
Phenotype match: 20

Priority tiers:

>= 80   high_priority
60-79   moderate_priority
40-59   low_priority
< 40    minimal_priority

This is not a clinical diagnostic score. It is a transparent prioritisation score for review.

Example Result

Top ranked variants from the current dataset:

Rank	Variant	Gene	Consequence	Score
1	VAR010	DMD	stop_gained	99
2	VAR001	BRCA1	stop_gained	98
3	VAR014	FBN1	splice_donor_variant	96
4	VAR019	MLH1	splice_acceptor_variant	95
5	VAR008	SCN1A	frameshift_variant	94

Outputs

The pipeline generates:

results/tables/ranked_variants.csv
results/tables/top_prioritised_variants.csv
results/reports/top_variant_review_report.md
results/figures/priority_score_distribution.png
results/figures/priority_tier_counts.png
results/figures/top_gene_priority_scores.png

Makefile Commands

make test
make score
make report
make figures
make pipeline

The full pipeline loads data, validates records, scores variants, generates review outputs, and creates figures.

Docker Workflow

docker build -t clinvar-variant-prioritisation:latest .
docker run --rm clinvar-variant-prioritisation:latest make test
docker run --rm clinvar-variant-prioritisation:latest make pipeline

Docker exposed two real issues.

First, make was missing inside the image.

Second, the non-root container user could not overwrite files under /app/results.

Both were fixed in the Dockerfile.

CI Workflow

GitHub Actions runs:

pytest test suite
full pipeline
expected output file checks

The workflow was also updated to opt into the Node.js 24 runtime.

Documentation

The repository includes:

README.md
docs/methods.md
docs/limitations.md
docs/evidence_map.md
docs/reviewer_guide.md
docs/evidence/

Main Takeaway

The project demonstrates how a small variant dataset can become a reproducible scientific workflow.

It includes:

validation
transparent scoring
ranked outputs
review reporting
visual analytics
Docker reproducibility
CI
evidence tracking

The result is not a clinical diagnostic system. It is a professional bioinformatics workflow showing how variant prioritisation logic can be made transparent, reproducible, and review-ready.

DEV Community