Benchmarking the Honesty of Fine-Mapping Credible Sets
Fine-mapping has a promise built into its output, and almost nobody checks whether the promise is kept.
When you run SuSiE on a GWAS locus, it hands you a credible set: a small group of variants that, at a stated confidence level like 95%, should contain the true causal variant. That 95% is a claim about reality. Among all the loci where SuSiE reports a 95% credible set, the true causal variant should be inside the set 95% of the time.
Is it? This post is about how to measure that, what I found when I did, and why the answer is more interesting than a single coverage number.
Why you cannot measure this on real data
Here is the catch that shapes everything. To check whether a credible set contains the causal variant, you need to know the causal variant. On real GWAS data, you do not. That is the entire reason fine-mapping exists.
So calibration is measured by simulation, and this is not a shortcut, it is the only valid method. You plant a known causal variant in a simulated locus with realistic linkage disequilibrium, generate GWAS summary statistics consistent with that truth, run SuSiE, and check whether its credible sets behave as advertised. This is exactly how SuSiE, FiniMOM, SuSiEx, and the recent SuSiE 2.0 were all validated. Known ground truth is the whole point.
(To keep the simulation anchored to reality, the project also runs SuSiE on the SORT1 / 1p13 cholesterol locus, one of the rare real loci where the causal variant, rs12740374, is functionally validated. SuSiE recovers it. But the calibration numbers themselves come from simulation, as they must.)
The simulation
Each simulated locus needs three things SuSiE-RSS consumes: z-scores, an LD matrix, and a sample size.
simulate_locus <- function(p = 600, n = 20000, n_causal = 1,
pve = 0.002, block_w = 0.9, seed = 1) {
set.seed(seed)
R <- make_ld_matrix(p, block_w = block_w, neighbourhood = 6)
causal <- sort(sample.int(p, n_causal))
ncp <- sqrt(n * pve)
b <- numeric(p); b[causal] <- ncp / sqrt(n_causal)
mu <- as.vector(R %*% b) # LD spreads the signal to proxies
z <- as.vector(MASS::mvrnorm(1, mu = mu, Sigma = R))
list(z = z, R = R, causal = causal)
}
The causal variant gets a true effect; LD spreads marginal signal to its neighbours (which is what makes fine-mapping hard); and the z-scores are drawn consistent with that LD structure. The effect size pve is tuned to the realistic marginal regime, mean absolute z around 6 at the causal variant, just above genome-wide significance. Make the signal too strong and fine-mapping becomes trivial; this is where it is actually interesting.
The metric, and validating the metric
Coverage is simple to state: across many loci, the fraction whose reported credible set contains the true causal variant. For a calibrated method at the 95% level, that should be about 0.95.
But a benchmark whose own metric is wrong is worse than no benchmark. So before trusting the metric on SuSiE, I tested it against a mock fine-mapper with known coverage: if you feed it credible sets that contain the causal variant exactly 95% of the time, the metric must report 0.95; feed it 80%, it must report 0.80.
mock <- function(n, true_cov) as.logical(runif(n) < true_cov)
empirical_coverage(mock(5000, 0.95)) # ~0.95
empirical_coverage(mock(5000, 0.80)) # ~0.80, correctly flagged
Only once the metric provably recovers known coverage is it allowed to judge SuSiE. (This logic was prototyped and unit-tested in Python first, then ported to R, so the arithmetic was known-correct independently of any fine-mapping run.)
The bug that inflated every credible set
The first real run produced credible sets of 100-plus variants. That is nonsense, no useful fine-mapping returns a 100-variant "credible" set. The cause was a single missing argument.
SuSiE's credible-set construction includes a purity filter: it prunes sets down to variants that are genuinely correlated with one another, discarding uncorrelated noise. That filter only runs if you pass the LD matrix to susie_get_cs:
# wrong: no purity filter, sets fill with uncorrelated noise
cs <- susie_get_cs(fit, coverage = 0.95)
# right: Xcorr = R activates the purity filter
cs <- susie_get_cs(fit, Xcorr = R, coverage = 0.95)
With the fix, sets collapsed from 100-plus variants to typically one. I found this by running single loci through SuSiE and inspecting the actual set sizes, rather than trusting an assumption about what they should be. The lesson is one the benchmark itself preaches: distrust a number that looks wrong, and check it against ground truth.
The result
Across thousands of loci, six difficulty conditions:
| Condition | Coverage | Abstention |
|---|---|---|
| Baseline (1 causal) | 100.0% | 6% |
| Strong LD | 100.0% | 5% |
| Two causal | 99.8% | 41% |
| Three causal | 99.7% | 65% |
| Weak effect | 99.4% | 64% |
| Very weak effect | 98.7% | 85% |
Coverage holds at or above the promised 95% everywhere, slightly conservative, which is the safe direction. So far, unremarkable: SuSiE is well-calibrated, as you would hope.
The interesting column is the second one.
The finding is in the abstention
Look at what happens as the loci get harder. SuSiE does not start returning wrong credible sets. It starts returning no credible set at all. Abstention climbs from 6% on clean single-causal loci, to 41% with two causal variants, to 85% in the lowest-power regime.
This is the mechanism behind the calibration. SuSiE keeps its sets sharp, usually a single variant, and protects its 95% promise by declining the loci where it cannot meet that bar. The high coverage is not achieved by hedging with big sets. It is achieved by abstaining on the hard cases instead of guessing.
That distinction matters enormously for how you read the output. A naive reading of "SuSiE returned no credible set" is failure. The correct reading is honesty: the method is telling you it cannot confidently localize the causal variant here, which is exactly what you want it to say when that is true. A fine-mapper that always returned a confident set, including on the loci where it has no business being confident, would be far more dangerous.
Why I think this generalizes
I have now built three benchmarks that ask the same question of very different systems: is the model's stated confidence honest? An LLM interpreting clinical variants abstained to "uncertain" exactly where the evidence ran out. A variant caller's QUAL scores could be checked for whether the stated confidence matched empirical precision. And here, SuSiE abstains rather than mislead as the genetics gets harder.
Three systems, one property worth measuring: not just whether the model is accurate, but whether it knows and admits the limits of what it knows. In all three, the well-built method's answer to "are you sure?" turns out to be the most informative thing it produces.
Run it
The full framework is R, built on susieR, with the simulation harness, the calibration metrics, the validated mock test, the figures, and the SORT1 real-locus template:
github.com/gbadedata/finemap-calibration-benchmark
Rscript setup.R # susieR, MASS, jsonlite
Rscript tests/test_calibration.R # metric validation, no susieR needed
Rscript R/run_benchmark.R # the full benchmark
Rscript R/make_figures.R # the figures
If you fine-map, try measuring your own pipeline's credible-set coverage by simulation. The coverage number is reassuring. The abstention behaviour is where you learn what your method actually does when the data gets hard.
References
Verified against the primary source.
- Wang G, Sarkar A, Carbonetto P, Stephens M (2020). A simple new approach to variable selection in regression, with application to genetic fine mapping. JRSS-B 82(5):1273-1300. doi:10.1111/rssb.12388
- Zou Y, Carbonetto P, Wang G, Stephens M (2022). Fine-mapping from summary data with the "Sum of Single Effects" model. PLOS Genetics 18(7):e1010299. doi:10.1371/journal.pgen.1010299
- Musunuru K, et al. (2010). From noncoding variant to phenotype via SORT1 at the 1p13 cholesterol locus. Nature 466(7307):714-719. doi:10.1038/nature09266
Top comments (0)