How synthetic data actually performs

#thesis #haven #protocol #synthetic

Originally published at prometheno.org.

Now let's think together. In The clinical-truth gap
I said clinical-truth verification belongs in medical empiricism. A
fair objection: why bother with real-patient infrastructure at all when
synthetic data exists? Synthea, MDClone, Syntegra, mostly.ai — generate
fake patients with the statistical properties of real ones, train
models on those, ship.

The honest answer is to look at how synthetic data actually performs,
not how it's pitched.

What synthetic does well

Three uses where it earns its place.

Pipeline testing. No PHI, no HIPAA review, no consent overhead.
Engineers stress-test ingestion code, validate FHIR mappings, exercise
edge cases. Synthea — the MITRE-developed open-source generator — was
built explicitly for this¹ and most US health-IT projects use it.

Training augmentation. For rare conditions where real-data samples
are clinically inadequate, synthetic supplementation lifts model
performance measurably. A 2024 study on rare thyroid cancer subtype
classification used text-guided diffusion to generate synthetic images
and improved subtype-classification AUC from 0.7364 to 0.8442². The
gain came from hybrid training. Synthetic + real beat real alone.

Aggregate statistical research. Questions like what's the average
HbA1c trajectory or what's the comorbidity prevalence often produce
similar answers on synthetic and real data, with no individual-level
exposure. A JMIR comparison study of five MDClone-generated cohorts
against their real counterparts found the analyses "provide a close
estimate of real data results in general," with caveats depending on
the patient-to-variable ratio³.

That's a real value proposition. The series doesn't dismiss it.

What the benchmarks show

Three places the numbers cut against synthetic-as-substitute.

Rare-event performance plateaus. SHEPHERD — a Harvard/Zitnik-lab
model trained on 40,000+ synthetic patients across 2,134 rare diseases
— achieved 40% top-1 accuracy in causal gene discovery when
evaluated against the real-world Undiagnosed Diseases Network
cohort⁴. Forty percent is useful as a triage tool. It is not
clinical-grade. The gap between synthetic-trained performance and
real-world ground truth is precisely the gap synthetic data can't close
on its own.

src="/blog/synthetic-data-actually-performs/figure-1-decision-matrix.svg"
alt="A two-column decision matrix. Left column 'SYNTHETIC SUFFICES' lists pipeline testing, training augmentation, aggregate research, hypothesis generation. Right column 'REAL DATA REQUIRED' lists regulatory submission, outcome verification, AI accountability, rare-event prediction."
caption="Synthetic data does real work in the left column. The right column is what HAVEN's real-patient infrastructure exists to serve."
/>

Hybrid almost always wins, and hybrid needs real data. Across
healthcare AI benchmarks, models trained on synthetic + real outperform
either alone. An AMD fundus-image study using ResNet-18 reached 85%
accuracy when combined data was used — outperforming the same
architecture trained on synthetic-only by a clinically meaningful
margin⁵. The destination is rarely synthetic. It's augmentation.

Privacy isn't as clean as advertised. Membership-inference attacks
against synthetic health data work. A 2022 JMIR analysis (since
extended by multiple 2024 papers) demonstrated attackers can infer with
substantial confidence whether a specific real patient's record was
used to generate a synthetic cohort⁶. The re-identification risk
rises for unique cases — older patients, rare conditions — which is
exactly the population synthetic data is most often used for.
Differential privacy mitigates this, but only at meaningful utility
cost.

Where synthetic structurally can't go

Two categorical limits. Better generators don't fix them.

Real outcomes. A synthetic patient doesn't develop sepsis. Doesn't
survive their cancer. Doesn't die from heart failure five years later.
Synthetic outcome data is fictional — produced to match training
distributions, not real biology. For Prometheno's longer-term horizon
— paying or penalizing AI vendors when their predictions match or miss
reality — the outcome side cannot be synthetic. No algorithm turns
simulation into observation.

Regulatory ground truth. FDA Center for Devices issued updated
real-world-evidence guidance December 2025⁷. The framework rests on
observational data from actual patients in actual care. Synthetic
control arms have a defined pathway as supplements to real evidence,
not substitutes for it. EMA position is similar. For any AI/ML medical
device seeking clearance, the path runs through real data.

What HAVEN does that synthetic can't

The strongest argument for HAVEN comes from accepting synthetic
data's strengths.

If synthetic-only training plateaus well below clinical-grade for rare
events, the path forward runs through hybrid models — and hybrid needs
governed real data. Consent, audit, and quality grading are what make
hybrid defensible at population scale.

If real outcomes can't be synthesized, AI accountability runs on real
outcome data. The infrastructure for collecting outcomes, tying them
to the predictions that preceded them, and attributing value back to
the contributing patients is what HAVEN's four primitives enable.

If membership-inference attacks compromise synthetic privacy claims,
the answer isn't to abandon real data. It's to govern access to real
data properly. Consent-attestation and hash-chained audit produce
traceability that de-identification alone never did.

Synthetic data is complementary. It strengthens the case for a
patient-sovereign protocol layer rather than replacing it.

What comes next

The next post commits to what would prove the whole argument wrong.

Walonoski, J., et al. "Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record." Journal of the American Medical Informatics Association 25, no. 3 (2018): 230-238. Open-source, MITRE-maintained, used widely for testing and demonstration. ↩
Frontiers in Digital Health, "Synthetic data generation: a privacy-preserving approach to accelerate rare disease research" (2025). Text-guided diffusion produced synthetic images with 92.2% realism rate; hybrid training improved AUC from 0.7364 to 0.8442. ↩
JMIR Medical Informatics 8, no. 2 (2020), "Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies." https://medinform.jmir.org/2020/2/e16492/ ↩
Alsentzer, E., et al. "Deep learning for diagnosing patients with rare genetic diseases." Zitnik Lab, Harvard. SHEPHERD model evaluated against the NIH Undiagnosed Diseases Network real-world cohort; published results show 40% top-1 accuracy on causal gene discovery. ↩
npj Digital Medicine, "Generating high-fidelity synthetic patient data for assessing machine learning healthcare software" (2020). https://www.nature.com/articles/s41746-020-00353-9. ResNet-18 on AMD fundus images: 85% accuracy with combined real+synthetic data. ↩
Hyeong, J., et al. "Membership inference attacks against synthetic health data." Journal of Biomedical Informatics 125 (2022). https://pmc.ncbi.nlm.nih.gov/articles/PMC8766950/. Extended by multiple 2024 papers including work on differentially private synthetic data and re-identification on tabular GANs. ↩
U.S. Food and Drug Administration. Use of Real-World Evidence to Support Regulatory Decision-Making for Medical Devices. Final guidance, December 16, 2025 (supersedes 2017 guidance). Real-World Data quality criteria emphasize relevance and reliability of observational data from actual patients in actual care. ↩