DEV Community

Cover image for From Variant CSV to Review-Ready Report: A Python Workflow With Docker and GitHub Actions
Oluwagbade Odimayo
Oluwagbade Odimayo

Posted on

From Variant CSV to Review-Ready Report: A Python Workflow With Docker and GitHub Actions

Variant prioritisation often starts with a table.

But a table alone does not answer the most important question:

Which variants deserve closer review, and why?
Enter fullscreen mode Exit fullscreen mode

The ClinVar Variant Prioritisation Workflow was built to answer that question with transparent scoring, validation, reporting, Docker, and CI.

Repository:

GitHub

Tech Stack

Python
pandas
Pydantic
matplotlib
pytest
Make
Docker
GitHub Actions
mamba
Enter fullscreen mode Exit fullscreen mode

What the Workflow Does

The workflow takes a curated inherited-disease variant dataset and ranks variants using transparent evidence rules.

Each variant receives:

priority score out of 100
priority tier
ranked output
review recommendation
Enter fullscreen mode Exit fullscreen mode

Dataset Fields

The curated dataset includes:

variant_id
gene
chromosome
position
reference
alternate
consequence
clinvar_significance
review_status
allele_frequency
inheritance
phenotype_match_score
computational_score
disease_area
Enter fullscreen mode Exit fullscreen mode

Validation Layer

Before scoring, the workflow checks:

required columns
valid allele frequency values
valid phenotype match score range
valid computational score range
record schema consistency
Enter fullscreen mode Exit fullscreen mode

Pydantic is used for schema validation.

This prevents the scoring logic from running on malformed records.

Scoring Framework

The score is out of 100:

ClinVar-style significance: 30
Review status: 15
Variant consequence: 20
Allele frequency rarity: 15
Phenotype match: 20
Enter fullscreen mode Exit fullscreen mode

Priority tiers:

>= 80   high_priority
60-79   moderate_priority
40-59   low_priority
< 40    minimal_priority
Enter fullscreen mode Exit fullscreen mode

This is not a clinical diagnostic score. It is a transparent prioritisation score for review.

Example Result

Top ranked variants from the current dataset:

Rank Variant Gene Consequence Score
1 VAR010 DMD stop_gained 99
2 VAR001 BRCA1 stop_gained 98
3 VAR014 FBN1 splice_donor_variant 96
4 VAR019 MLH1 splice_acceptor_variant 95
5 VAR008 SCN1A frameshift_variant 94

Outputs

The pipeline generates:

results/tables/ranked_variants.csv
results/tables/top_prioritised_variants.csv
results/reports/top_variant_review_report.md
results/figures/priority_score_distribution.png
results/figures/priority_tier_counts.png
results/figures/top_gene_priority_scores.png
Enter fullscreen mode Exit fullscreen mode

Makefile Commands

make test
make score
make report
make figures
make pipeline
Enter fullscreen mode Exit fullscreen mode

The full pipeline loads data, validates records, scores variants, generates review outputs, and creates figures.

Docker Workflow

docker build -t clinvar-variant-prioritisation:latest .
docker run --rm clinvar-variant-prioritisation:latest make test
docker run --rm clinvar-variant-prioritisation:latest make pipeline
Enter fullscreen mode Exit fullscreen mode

Docker exposed two real issues.

First, make was missing inside the image.

Second, the non-root container user could not overwrite files under /app/results.

Both were fixed in the Dockerfile.

CI Workflow

GitHub Actions runs:

pytest test suite
full pipeline
expected output file checks
Enter fullscreen mode Exit fullscreen mode

The workflow was also updated to opt into the Node.js 24 runtime.

Documentation

The repository includes:

README.md
docs/methods.md
docs/limitations.md
docs/evidence_map.md
docs/reviewer_guide.md
docs/evidence/
Enter fullscreen mode Exit fullscreen mode

Main Takeaway

The project demonstrates how a small variant dataset can become a reproducible scientific workflow.

It includes:

validation
transparent scoring
ranked outputs
review reporting
visual analytics
Docker reproducibility
CI
evidence tracking
Enter fullscreen mode Exit fullscreen mode

The result is not a clinical diagnostic system. It is a professional bioinformatics workflow showing how variant prioritisation logic can be made transparent, reproducible, and review-ready.

Top comments (0)