Variant prioritisation often starts with a table.
But a table alone does not answer the most important question:
Which variants deserve closer review, and why?
The ClinVar Variant Prioritisation Workflow was built to answer that question with transparent scoring, validation, reporting, Docker, and CI.
Repository:
Tech Stack
Python
pandas
Pydantic
matplotlib
pytest
Make
Docker
GitHub Actions
mamba
What the Workflow Does
The workflow takes a curated inherited-disease variant dataset and ranks variants using transparent evidence rules.
Each variant receives:
priority score out of 100
priority tier
ranked output
review recommendation
Dataset Fields
The curated dataset includes:
variant_id
gene
chromosome
position
reference
alternate
consequence
clinvar_significance
review_status
allele_frequency
inheritance
phenotype_match_score
computational_score
disease_area
Validation Layer
Before scoring, the workflow checks:
required columns
valid allele frequency values
valid phenotype match score range
valid computational score range
record schema consistency
Pydantic is used for schema validation.
This prevents the scoring logic from running on malformed records.
Scoring Framework
The score is out of 100:
ClinVar-style significance: 30
Review status: 15
Variant consequence: 20
Allele frequency rarity: 15
Phenotype match: 20
Priority tiers:
>= 80 high_priority
60-79 moderate_priority
40-59 low_priority
< 40 minimal_priority
This is not a clinical diagnostic score. It is a transparent prioritisation score for review.
Example Result
Top ranked variants from the current dataset:
| Rank | Variant | Gene | Consequence | Score |
|---|---|---|---|---|
| 1 | VAR010 | DMD | stop_gained | 99 |
| 2 | VAR001 | BRCA1 | stop_gained | 98 |
| 3 | VAR014 | FBN1 | splice_donor_variant | 96 |
| 4 | VAR019 | MLH1 | splice_acceptor_variant | 95 |
| 5 | VAR008 | SCN1A | frameshift_variant | 94 |
Outputs
The pipeline generates:
results/tables/ranked_variants.csv
results/tables/top_prioritised_variants.csv
results/reports/top_variant_review_report.md
results/figures/priority_score_distribution.png
results/figures/priority_tier_counts.png
results/figures/top_gene_priority_scores.png
Makefile Commands
make test
make score
make report
make figures
make pipeline
The full pipeline loads data, validates records, scores variants, generates review outputs, and creates figures.
Docker Workflow
docker build -t clinvar-variant-prioritisation:latest .
docker run --rm clinvar-variant-prioritisation:latest make test
docker run --rm clinvar-variant-prioritisation:latest make pipeline
Docker exposed two real issues.
First, make was missing inside the image.
Second, the non-root container user could not overwrite files under /app/results.
Both were fixed in the Dockerfile.
CI Workflow
GitHub Actions runs:
pytest test suite
full pipeline
expected output file checks
The workflow was also updated to opt into the Node.js 24 runtime.
Documentation
The repository includes:
README.md
docs/methods.md
docs/limitations.md
docs/evidence_map.md
docs/reviewer_guide.md
docs/evidence/
Main Takeaway
The project demonstrates how a small variant dataset can become a reproducible scientific workflow.
It includes:
validation
transparent scoring
ranked outputs
review reporting
visual analytics
Docker reproducibility
CI
evidence tracking
The result is not a clinical diagnostic system. It is a professional bioinformatics workflow showing how variant prioritisation logic can be made transparent, reproducible, and review-ready.
Top comments (0)