Oluwagbade Odimayo

Posted on Jun 21

Your Variant Caller Tells You How Confident It Is. Have You Ever Checked If It's Telling the Truth?

#machinelearning #bioinformatics #python #datascience

Every variant caller you have ever used attaches a number to each call: QUAL. It is a confidence score, a claim about how likely the call is to be correct. You filter on it constantly. QUAL >= 30, QUAL >= 20, whatever your pipeline settled on years ago.

Here is a question almost nobody asks: is that number honest?

When the caller stamps a variant at QUAL 30, it is claiming the call is 99.9% likely to be real. Among all the calls it stamps at that confidence, are 99.9% of them actually correct? Or is the caller systematically overstating how sure it is, and if so, where?

This is the question of calibration, and it is borrowed from machine-learning evaluation, where it is standard practice and where, for variant callers, it is almost completely absent. This post is about why it matters, how to measure it, and a small open-source framework that does.

What QUAL actually claims

QUAL is phred-scaled. The definition is precise:

QUAL = -10 * log10(P(call is wrong))

Invert it and you get the caller's stated probability that the call is correct:

def qual_to_confidence(qual: float) -> float:
    if qual <= 0:
        return 0.0
    p_wrong = 10.0 ** (-qual / 10.0)
    return 1.0 - p_wrong

So QUAL 10 claims 90% confidence, QUAL 20 claims 99%, QUAL 30 claims 99.9%. These are not vague quality hints. They are probability statements, and probability statements can be checked against reality.

Calibration: comparing the claim to the truth

To check them you need ground truth. For human variant calling that means the Genome in a Bottle (GIAB) benchmark set, the community-standard truth set for samples like HG001/NA12878. Run your caller, compare each call against GIAB: true positive if it is in the truth set, false positive if it is not.

Now group the calls by their stated confidence and, in each group, measure the empirical precision, the fraction that are actually true positives. A well-calibrated caller has, in every bin, empirical precision close to stated confidence. Plot one against the other and an honest caller's points sit on the diagonal.

The scalar summary of how far off you are is Expected Calibration Error (ECE): the support-weighted mean absolute gap between stated confidence and empirical precision across the bins.

ece = 0.0
for b in bins:
    gap = abs(b.mean_confidence - b.empirical_precision)
    ece += (b.n / total) * gap

ECE of 0 is perfect honesty. A large ECE with stated confidence consistently above empirical precision means the caller is overconfident: its QUAL says more than the calls deliver.

Why this is not academic

If QUAL is inflated, every threshold built on it is wrong, and wrong in a way that hides. Filter at "QUAL >= 30, surely safe" and, if the caller is overconfident, you are quietly keeping false positives you think you excluded. Worse, the inflation is rarely uniform. It concentrates in the hard parts of the genome, low-complexity regions, segmental duplications, low-mappability stretches, which is exactly where clinically important variants sometimes live and exactly where you can least afford a false sense of safety.

A plain precision/recall benchmark never reveals this. It tells you the caller's overall accuracy. It does not tell you whether you can trust the per-call confidence the caller hands you, which is the thing your filters actually consume.

A worked illustration

I built a small framework that computes exactly this: variant-calling-calibration-benchmark. It evaluates a caller on four layers: concordance (precision/recall/F1 vs GIAB), stratification (the same, split by genomic difficulty), calibration (the curve and ECE above), and a filtering-as-abstention analysis (more on that below).

To show the method end to end without a multi-gigabyte download, the repository ships a synthetic caller, built deliberately with the kind of miscalibration real callers exhibit: overconfidence concentrated in difficult regions and indels. I want to be plain that these demo numbers are synthetic; the point is to illustrate what the framework measures, not to report a discovery. Running it on a real caller is one command, shown at the end.

On that synthetic caller, the calibration layer reports:

Expected Calibration Error (ECE): 0.14
Mean stated confidence:           0.95
Empirical precision:              0.81
Verdict:                          OVERCONFIDENT

The caller asserts 95% mean confidence and delivers 81% precision. The calibration curve shows the points sitting below the diagonal across the whole confidence range. That gap is invisible to an F1 score and obvious the moment you plot stated confidence against empirical precision.

The stratification layer shows where the gap lives: near-perfect concordance (F1 ~0.99) in high-confidence regions, collapsing to ~0.77 in segmental duplications and low-mappability regions. The overconfidence is worst exactly where the genome is hardest.

Filtering as an abstention decision

Here is the part I find most useful, and it comes from thinking about callers the way you would think about any model that can decline to answer.

A QUAL filter is not just a quality cutoff. It is a deferral decision: every call below the threshold is one the caller is choosing not to commit to. Raise the threshold and you remove false positives (good) but also discard true positives (costly). The framework sweeps thresholds and finds two reference points:

the threshold that maximises retained F1 (best overall), and
the lowest threshold that reaches a target precision such as 99%, the point where what you keep is trustworthy enough to act on without review.

The distance between those two is a direct measure of the safe-versus-decisive tradeoff. On the synthetic caller, reaching 99% precision requires filtering all the way up to QUAL ~90, which discards most calls. The caller can be made trustworthy, but only by being made nearly silent. That is a property worth knowing before you deploy it, and a plain benchmark will never tell you.

Validating the evaluator itself

A benchmark whose metric is wrong is worse than no benchmark, so the ECE implementation is tested against inputs with known calibration. A synthetically honest caller (true-positive probability set exactly to the QUAL-implied confidence) must yield ECE near zero. A synthetically overconfident caller (high QUAL, 50% actually true) must yield large ECE. Both are asserted in the test suite. The metric is proven correct before it is trusted to judge anything.

def test_well_calibrated_low_ece():
    report = compute_calibration(well_calibrated_set())
    assert report.ece < 0.05

def test_overconfident_high_ece():
    report = compute_calibration(overconfident_set())
    assert report.ece > 0.2

Running it on real data, the right way

The framework's built-in matcher uses exact position-and-allele comparison, which is fine for normalised VCFs but stricter than it should be on real data, where variant representation differences are common. For real GIAB benchmarking the correct tool is hap.py, the GA4GH/GIAB field standard, which does proper normalisation and haplotype-aware matching.

So the real-data path delegates matching to hap.py and runs the calibration and abstention layers on its annotated output:

# hap.py does the matching (the hard, solved problem)
hap.py truth.vcf.gz your_caller.vcf.gz -f truth.bed -r GRCh38.fasta -o happy_out

# this framework adds the calibration and abstention analysis hap.py lacks
python3 -m src.run_benchmark --happy-vcf happy_out.vcf.gz --caller-name gatk-hc

That division of labour is the point: hap.py answers "how accurate, where," and this framework answers "is the confidence honest, and where should the caller stop trusting itself." The second question is the one almost no variant-calling benchmark asks.

The takeaway

Accuracy and calibration are different properties, and your filters depend on the second one. A caller can have a respectable F1 and still lie to you about its confidence in exactly the regions where being lied to costs the most. Measuring calibration is cheap, it borrows a standard idea from ML evaluation, and it tells you something a precision/recall table cannot.

I applied this same way of thinking, measure honesty of confidence, not just accuracy, to large language models interpreting clinical variants in a companion project. Different model, same question. It turns out to be the more interesting question in both cases.

Code, tests, and the full four-layer framework: github.com/gbadedata/variant-calling-calibration-benchmark

If you benchmark variant callers and have never plotted stated confidence against empirical precision, try it on your own GATK or DeepVariant output. I would be curious whether your caller is as honest as you assume.

References

Krusche P, et al. (2019). Best practices for benchmarking germline small-variant calls in human genomes. Nature Biotechnology 37(5):555-560. doi:10.1038/s41587-019-0054-x
Zook JM, et al. (2019). An open resource for accurately benchmarking small variant and reference calls. Nature Biotechnology 37(5):561-566. doi:10.1038/s41587-019-0074-6
Wagner J, et al. (2022). Benchmarking challenging small variants with linked and long reads. Cell Genomics 2(5):100128. doi:10.1016/j.xgen.2022.100128

DEV Community