When AI safety training withholds what could help you

#aisafety #healthcare #evaluation #alignment

A new study, IatroBench, provides evidence that heavy safety training in AI models can cause harm by withholding accurate medical information from patients — even as the same models provide that information freely to physicians asking identical clinical questions. The researchers call this "iatrogenic omission harm": injury caused not by what the AI gets wrong, but by what it leaves out.

Key facts

What: A pre-registered study finds heavily safety-trained models give doctors medical information they refuse to give ordinary people, with identical facts.
When: 2026-06-25
Primary source: read the source (arXiv 2604.07709)

The study, posted on arXiv, was pre-registered: the researchers committed to their methods and success criteria before running it, guarding against fishing for a conclusion. They wrote dozens of medical scenarios and posed each to several leading AI models, keeping the medical facts identical but changing who was asking. Sometimes the question came from a physician; sometimes from an ordinary patient. The clinical content was the same — only the apparent identity of the asker changed.

The models give doctors more than they give patients, even though the underlying facts are identical. The same model that walks a physician through a situation will hedge, soften, or refuse when an ordinary person asks the same thing. A patient who is refused accurate, relevant information can be hurt by that silence just as surely as by a mistake.

Three details sharpen the picture. First, the gap was widest in the most heavily safety-trained model in the study, indicating this is a side effect of the safety training itself — the more polished the caution, the wider the gap. Second, the trigger isn't credentials. You don't need to prove you're a doctor; you just need to sound knowledgeable. An informed layperson, or someone framing the question like a professional, can often recover what a worried-sounding "patient" is refused, which means the model is keying off tone, not genuine need. Third, and most damning for how the industry evaluates itself, when the researchers asked a standard automated judge — an AI grading other AIs — to flag this withholding as harmful, it almost entirely failed to see it. Our explainer on using AI to grade AI is relevant here, because it's exactly that common shortcut that proved blind to this problem.

This is a contrarian result in a field where "more safety" is the default applause line. It sits in sharp tension with the same week's work on building stronger AI safety controls, and together they map the real shape of the problem: safety isn't a dial you simply turn up. Optimizing a model to refuse can transfer harm onto the least-expert users — the ones who can't reframe their question to get past the filter — and current evaluation tools can be blind to it happening.

The authors offer an important caveat: the scenarios were deliberately engineered to create collisions between safety and helpfulness, so the rates they report describe the test's design, not how often this happens in everyday use. This is not evidence that medical AI is broadly harmful. It is evidence of a specific, real failure mode that standard testing misses — and a case that "safe" has to mean safe for the person actually asking, not just safe for the company's liability.

Originally published on Ground Truth, where every claim is checked against the primary source.

DEV Community

When AI safety training withholds what could help you

Key facts

Top comments (0)