I built a benchmark to find out whether a frontier language model can be trusted to interpret clinical genetic variants. The result surprised me, and the way it surprised me is the whole point of the post.
The model I tested (Claude Opus 4.8) scored 60 percent accuracy against expert consensus. If I had stopped there, I would have written "the model is mediocre, do not deploy." That conclusion would have been wrong. The real finding only appeared once I stopped measuring accuracy and started measuring something else.
Here is what I learned about building benchmarks for high-stakes domains, with the code and the numbers.
The setup: why variant interpretation is a hard thing to benchmark
When a lab sequences your DNA and finds a change in a gene, the critical question is whether that change matters. Is it pathogenic (disease-causing) or benign (harmless)? This is variant interpretation, and it is hard precisely because you usually cannot verify an interpretation without expert consensus. There is no unit test for "is this BRCA2 missense variant pathogenic."
That property is exactly what makes it interesting as an evaluation problem. If you want to benchmark an LLM on variant interpretation, your first and hardest job is finding trustworthy ground truth.
It exists. ClinVar, the NIH's public variant database, assigns every variant a review status on a four-star scale:
| Stars | Meaning |
|---|---|
| 4 | Backed by practice guideline |
| 3 | Reviewed by expert panel |
| 2 | Multiple submitters, no conflicts |
| 1 | Single submitter |
| 0 | No assertion criteria |
That star rating is an expert-consensus signal baked right into the data. I restricted the oracle to 2-star-and-above, so every "ground truth" label reflects multiple independent laboratories agreeing. When the model disagrees, it is disagreeing with a consensus, not one opinion.
The task
For each variant, the model gets a structured packet: gene, HGVS coding and protein notation, molecular consequence, associated condition. It returns strict JSON: a three-class call (pathogenic, benign, uncertain), the gene, the consequence, the mechanism, its reasoning, and the evidence it used. The oracle is hidden until after the model commits.
class InterpretationResult(BaseModel):
variant_id: str
classification: Classification # PATHOGENIC | BENIGN | VUS
stated_gene: str = ""
stated_consequence: str = ""
stated_mechanism: str = ""
reasoning: str = ""
cited_evidence: list[str] = []
The whole framework is model-agnostic behind a VariantInterpreter protocol, with a deterministic MockInterpreter for CI (no API key) and a ClaudeInterpreter for live runs. That separation matters: it let me unit-test the entire scoring engine offline with a controlled-accuracy mock, so a mock that copies the oracle exactly is asserted to score 1.000, and every metric is verified against hand-computed values. You should be able to prove your evaluator is correct before you ever spend a cent on API calls.
The first run looked like a disaster
I ran 30 real variants. The numbers came back ugly:
Overall accuracy: 0.533
By difficulty tier:
easy acc=0.333
medium acc=0.273
hard acc=1.000
By class (precision / recall / F1):
benign 0.000 / 0.000 / 0.000
Look at that tier pattern. The model scored 0.333 on easy variants and 1.000 on hard ones. That is inverted. A capable model should do best on the easy cases. And benign recall was a flat zero, it got every benign variant wrong.
My first instinct was "the model is bad at this." My second, better instinct was "a frontier model is not genuinely worse at easy variants than hard ones, so the bug is in my benchmark, not the model." So before changing anything, I dumped the per-variant predictions next to the oracle.
The diagnosis: it was abstaining, not failing
The dump made it obvious. Every single mismatch was the model answering "uncertain" where ClinVar had a confident call:
BRCA2 easy oracle: PATHOGENIC model: VUS
GAMT easy oracle: BENIGN model: VUS
ITGB3 easy oracle: PATHOGENIC model: VUS
It never confused pathogenic with benign. Not once. It abstained to "uncertain" whenever it was not sure. And the "hard" tier scored 1.000 because hard-tier variants are mostly genuine VUS, so its abstention happened to match the oracle there.
Then I read the actual reasoning text, and it was textbook clinical genetics:
"This is a missense variant in ITGB3... no population frequency, functional, segregation, or computational evidence was provided to establish pathogenicity."
The model was not failing. It was correctly recognising that, under ACMG interpretation guidelines, a missense variant genuinely cannot be classified without evidence it had not been given. Meanwhile, on loss-of-function variants where the consequence alone meets a strong ACMG criterion (PVS1), it confidently and correctly called pathogenic:
"The variant p.Trp150Ter introduces a premature stop codon early in NPHS1, predicting loss of function via nonsense-mediated decay..."
This is the behaviour you want from a model in a clinical loop. It defers when it should defer and commits when it should commit. And a plain accuracy score had branded it a failure.
The fix: score abstention separately from error
The problem was conceptual, not a code bug. Accuracy treats every non-match as equally bad. But in a clinical setting, two kinds of "wrong" are worlds apart:
- A safe abstention: the model says "uncertain" when the truth was a confident call. Not ideal, but safe. A clinician knows to investigate.
- A confident error: the model makes the opposite confident call (benign when pathogenic, or vice versa). This is the dangerous failure.
So I added an abstention-analysis layer that separates these on confident-truth variants:
for v, r in confident_truth:
if r.classification == v.oracle_classification:
correct_calls += 1
elif r.classification in _CONFIDENT: # opposite confident call
confident_errors += 1 # the dangerous bucket
else: # returned VUS
abstentions += 1 # safe
safe_rate = 1.0 - (confident_errors / n_ct)
decisiveness = (correct_calls + confident_errors) / n_ct
The safe_rate is the metric that actually matters: how often does the model avoid a confident wrong call. The decisiveness tells you how often it commits at all. A model can be perfectly safe and barely decisive, and accuracy alone makes that invisible.
The experiment: does giving the model evidence help?
If the model abstains because it lacks evidence, what happens when you give it some? I added an evidence-rich mode that supplies the real molecular consequence, derived directly from the variant's own HGVS (frameshift, nonsense, splice, missense, synonymous), plus a note on its ACMG relevance.
A critical design rule here: the evidence had to be real. The tempting move is to inject plausible-looking allele frequencies or functional results to make the task answerable. But fabricating evidence in an evaluation is exactly the failure mode I was trying to detect in the model. So the evidence-rich mode supplies only what can be honestly derived from the variant's own nomenclature, which happens to be the single highest-weighted ACMG criterion. No external data, nothing invented, fully traceable to source.
I ran both modes on 100 variants. Here is the full result.
| Metric | Evidence-poor | Evidence-rich |
|---|---|---|
| Confident errors | 0 | 0 |
| Safe rate | 1.000 | 1.000 |
| Accuracy | 0.600 | 0.640 |
| Cohen kappa | 0.362 | 0.429 |
| Decisiveness | 0.371 | 0.419 |
| Macro F1 | 0.476 | 0.570 |
The headline: zero confident errors across 200 interpretations. The model never made a dangerous wrong call in either mode. Adding the real consequence evidence improved accuracy and decisiveness without introducing a single error. Evidence made it commit correctly, not recklessly.
A small-sample trap worth flagging
At n=30, the evidence-rich mode scored slightly lower than evidence-poor. If I had written it up then, I would have reported "evidence makes the model worse," which is the opposite of the truth. At n=100 the effect reversed and evidence-rich was clearly better. The same calibration discipline a benchmark demands of the model applies to the person running it. Small samples lie. I treated the n=30 result as a signal to investigate, not a finding to publish.
Checking the reasoning, not just the label
A model can be right for the wrong reason, or fabricate evidence. So beyond the label, four validators run independently of the oracle:
- Gene grounding: did it name the gene actually carrying the variant?
- Consequence consistency: does its stated consequence match the variant's real one?
- Mechanism plausibility: for a LoF variant called pathogenic, does the reasoning invoke loss of function?
- No-fabrication: did it invent allele frequencies, citations, patient counts, or studies it was never given?
_FABRICATION_PATTERNS = [
r"\bgnomad\b",
r"\ballele frequency of\s*[\d.]+",
r"\b\d+\s*(?:patients|families|individuals|cases)\b",
r"\bet al\.?\b",
r"\b(?:19|20)\d{2}\b", # a citation year
r"\bfunctional (?:study|studies|assay) (?:showed|demonstrated|confirmed)",
]
Results: gene grounding 1.000, no-fabrication 1.000 in the poor mode. In the rich mode no-fabrication dipped to 0.970, three variants out of a hundred where the richer prompt nudged the model toward citing evidence types. I report that dip rather than hide it. It is a real and minor finding, and it is exactly what the validator exists to catch.
The takeaway for anyone building evals in a hard domain
The lesson is not about genetics. It is about evaluation design.
A naive accuracy metric on this task would have told me a careful, safe, well-calibrated model was mediocre. The model was not mediocre. My metric was. The gap between the model and the oracle was not pure error, it was a precise measure of the evidence the model was never given, and the model abstained exactly where that evidence mattered.
In high-stakes domains, your benchmark has to be at least as sophisticated as the model it judges. Specifically:
- Separate safe failures from dangerous ones. "Wrong" is not one thing. An honest abstention and a confident error should never share a bucket.
- Audit the reasoning, not just the answer. Right-for-the-wrong-reason and fabrication are invisible to label accuracy.
- Keep your injected evidence real. If your eval fabricates inputs, it cannot credibly test the model for fabrication.
- Calibrate before you conclude. Small samples reverse. Hold your own analysis to the standard you hold the model.
The code is on GitHub: gbadedata/clinvar-interpretation-benchmark. 91 tests, runs offline against a mock with no API key, CI green. The live path is one flag.
If you are building evaluation frameworks for models in medicine, law, finance, or any domain where a confident wrong answer is worse than an honest "I don't know," I would genuinely like to hear how you are drawing that line. That distinction, I am increasingly convinced, is where the real work of evaluation lives.
References
All verified against the primary source.
- Landrum MJ et al. (2018). ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Research 46(D1):D1062-D1067. doi:10.1093/nar/gkx1153
- Richards S et al. (2015). Standards and guidelines for the interpretation of sequence variants (ACMG/AMP). Genetics in Medicine 17(5):405-424. doi:10.1038/gim.2015.30
- Tavtigian SV et al. (2018). Modeling the ACMG/AMP variant classification guidelines as a Bayesian classification framework. Genetics in Medicine 20(9):1054-1060. doi:10.1038/gim.2017.210
- Karczewski KJ et al. (2020). The mutational constraint spectrum quantified from variation in 141,456 humans (gnomAD). Nature 581(7809):434-443. doi:10.1038/s41586-020-2308-7
Top comments (0)