DEV Community

Cover image for Claude Found Eleven Medical Errors in One Family's Records
Arthur
Arthur

Posted on • Originally published at pickles.news

Claude Found Eleven Medical Errors in One Family's Records

A software engineer with a child in a long-running diagnostic process built a homegrown medical-record service for his family, dumped years of accumulated records into Claude Opus, and asked the model to look for missed diagnoses, contraindications, and dosing errors. The model came back with eleven concrete items. Some were trivial: drug interactions an EMR should already have flagged. Some were meaningful: a missing routine test that would have changed a treatment plan, a mislabelled prescription that nobody had reconciled across three different specialists. All of them were verifiable against the records.

I want to take this experiment seriously. Not as a verdict on whether AI is going to replace doctors. As something narrower: an indication of which specific kind of medical work an LLM can already do better than the system around it, and why.

What was actually done

The setup was a personal project, not a clinical product. The engineer wrote a small Node service backed by SQLite, built a family-level data model with explicit linking tables connecting prescriptions to diagnoses to providers to appointments, and pulled in years of records — outpatient visits, lab results, imaging reports, vaccination history, growth charts. The data model is the unfashionable kind: thirteen medical tables with linking rows, a few hundred relationships, no graph database, no vector store, just JOINs.

The Claude integration is the part that matters. There is no API call from the application code. Instead the service exposes a single endpoint, GET /api/patient-context, that returns the full state of the database as one JSON document — diagnoses with the providers who entered them, medications with prescribing rationale and discontinuation reasons, all lab results with reference ranges, every appointment with its associated documents. The blob is roughly 200–500 KB per patient, which Anthropic's published context limits handle without breaking a sweat. The user opens Claude Code, points the model at a long instruction file describing the analysis protocol, and the model fetches the full patient context in a single request, processes the new data, and writes results back through the same API.

The shape of the JSON the model ingests is not exotic. Roughly:

// GET /api/patient-context — full state of one family member, ~200–500 KB
{
  "patient": { "id": "p_kid_2", "dob": "2015-04-12", "sex": "F" },
  "diagnoses": [
    {
      "id": "dx_18", "icd10": "K59.00", "label": "Constipation, unspecified",
      "entered_by": "prov_07", "entered_at": "2024-09-14", "status": "active",
      "linked_medications": ["rx_42"], "linked_appointments": ["appt_113"]
    }
  ],
  "medications": [
    {
      "id": "rx_42", "drug": "PEG 3350", "dose": "8.5 g/day",
      "prescribed_by": "prov_07", "prescribed_at": "2024-09-14",
      "rationale": "dx_18; first-line per AAP 2023",
      "discontinued_at": null, "discontinuation_reason": null
    }
  ],
  "labs": [
    {
      "id": "lab_91", "panel": "CBC", "drawn_at": "2025-02-03",
      "values": [
        { "analyte": "Hemoglobin", "value": 11.2, "unit": "g/dL",
          "ref_low": 11.5, "ref_high": 15.5, "flag": "low" }
      ],
      "ordering_provider": "prov_07"
    }
  ],
  "appointments": [
    { "id": "appt_113", "date": "2024-09-14", "provider": "prov_07",
      "specialty": "pediatrics", "documents": ["doc_pdf_204"],
      "linked_diagnoses": ["dx_18"] }
  ],
  "providers": [
    { "id": "prov_07", "specialty": "pediatrics" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

The interesting feature is the linking. Diagnoses point at medications point at appointments, so the model can reconstruct that this prescription exists because of that diagnosis recorded at that visit, without having to guess. A flat dump of the same tables would not give it that. A graph database would, but a graph database isn't necessary; foreign keys and a serializer that follows them are enough.

By the engineer's own math, a full review of an entire family member's history runs roughly $0.60 at current Anthropic API rates (about $5/M input tokens and $25/M output for Opus), with new documents arriving in his household two to four times a month. He's on a Claude Max subscription anyway, so the marginal cost is essentially zero.

What the benchmarks tell us, and what they don't

The headline numbers on LLM medical performance look good. GPT-4 scored about 71.6% on the MedQA benchmark in zero-shot — the standard medical-school-exam test — and the most recent generation has passed it well. OpenAI's GPT-5, released late 2025, scored 95.84% on the same benchmark; o1 Preview hit 95.00% in early 2025 evaluations. On paper the models are scoring at or above human passing thresholds.

The headline numbers also tell you almost nothing about what happens in a clinic. Diagnostic accuracy drops sharply when LLMs gather information through conversation rather than from a static case description — GPT-4's accuracy fell from 82% to 62.7% in one such evaluation. The benchmarks measure exam performance on curated cases. Real medicine is messier than that, and the models know less than the headline suggests when they have to ask the right questions in the right order to pull the relevant information out of a patient.

That's the failure mode the eleven-error experiment doesn't run into. The engineer did not ask the model to take a history. He gave it the history. All of it. In one prompt. The model's task wasn't differential-diagnosis-from-symptoms; it was cross-record review against existing data. Those are different problems, and the second one is where the model's structural advantage actually lives.

The advantage isn't the model. It's the context.

A primary-care provider in the United States gets, on a good day, fifteen minutes per appointment. The electronic health record they're working with is, almost universally, terrible at surfacing longitudinal context (it was built mostly for billing, with the clinical picture grafted on). The provider's working memory of one patient's history has to compete with the working memory of twenty other patients on the same day, plus the memory of every clinical guideline and drug interaction and reference range they ever learned. They are good at this. They are not magicians.

A 2014 study by Singh, Meyer, and Thomas, published in BMJ Quality & Safety, estimated that about 5% of US adult outpatients experience a diagnostic error each year — a missed or delayed correct diagnosis — and that more than half of those errors carry potential for harm. Hardeep Singh and Mark Graber have spent the last decade-plus building out the literature on where the errors come from. The recurring finding is that the errors are not stupid. They are systemic. They happen at the seams between specialists, in the gaps between the EMR's view and the patient's full record, in the moments when the provider would have seen the issue if they had been looking at everything at once.

An LLM with the full record in its context is, for that one task, looking at everything at once. It is not a better diagnostician than the doctor. It is the same level of diagnostician with a longer attention span and no other patients in the waiting room. That is an unfair advantage. It is also not one the healthcare system is set up to give to a human provider any time soon.

What the experiment surfaced, in categories

Without naming the family or the specific clinical findings, the eleven flagged items broke down roughly like this. There were a small number of missed routine tests — standard panels that should have been ordered for the documented diagnosis at the documented intervals, and weren't. There were cross-specialist conflicts: three different specialists in the same field had each prescribed their own regimen over an eighteen-month window, and only one of the three had asked what the child was already taking. There was a guideline-versus-recommendation mismatch, where one provider had recommended a treatment modality (in this case osteopathy, which mainstream evidence-based reviews classify as outside the bounds of evidence-based medicine for most claimed indications) and the model surfaced the gap without editorializing about the provider. And there were small administrative misses — vaccination intervals, growth-chart percentile drift, expired lab references.

Side-by-side, the same five categories look like this:

Category What the EMR view typically shows What Claude surfaced with the full record in context
Missed routine tests One ordered panel per visit; intervals across visits aren't computed for you A standard panel that the documented diagnosis calls for at six-month intervals, last ordered eighteen months ago
Cross-specialist conflicts One specialist's regimen at a time; reconciliation depends on patient or family memory Three independent regimens for the same condition prescribed by three different specialists; only one had asked about existing medications
Guideline-versus-recommendation mismatch The recommendation as written; no automatic comparison to evidence-based summaries The recommendation alongside the relevant guideline language, with the gap named, no editorial about the provider
Drug-interaction trivia Single-prescriber view; cross-prescriber and supplement interactions surface only at the pharmacy, if at all Whole-record interaction scan, including herbals and supplements logged in visit notes
Administrative drift One vaccine row, one growth-chart point, one lab reference at a time Vaccine intervals, growth-percentile drift, and stale reference-range comparisons computed across years

The model did not say any of these were dangerous. It said, in each case, here is what was done, here is what the standard guidelines call for, here is the gap. The standard primary-care work product, generated in one pass.

The model also got something wrong

Worth dwelling on. At one point Claude misread a PDF lab report — a value from one row of a tightly-spaced table got attributed to the wrong analyte, and the model dutifully flagged a "critical deviation" that turned out not to exist. The engineer caught it on review, asked for a re-check, and the model corrected the analysis. The fix in the long-term protocol was a two-pass verification step: re-render PDFs at 400-500 DPI, line-by-line cross-check every numeric value against the original. After that change the same kind of error has not recurred.

This is the right order of operations. The model did real work. The model also produced a confidently-wrong line of analysis. A human had to verify before anything went into the patient's record. None of which negates the eleven concrete corrections; all of which means the model's outputs need a verification layer before they are treated as findings.

Why this is awkward

The experiment is replicable. Anyone with access to their own records (which in much of the United States is now legally required to be available through a patient portal) can do roughly this. The technical bar is low: a few hundred lines of code, some standard tooling, an LLM subscription. The clinical bar is high: the human in the loop has to be careful, has to know what an LLM is and is not good at, has to verify each flag before treating it as actionable.

Eric Topol has been arguing for years, most recently in his late-2024 talks at NIH and RSNA, that AI in clinical settings should be augmentation rather than replacement — the model handles the cross-record reasoning the human can't physically do, the human handles the judgment the model can't reliably make. The experiment is one consumer-grade implementation of exactly that vision, except it routes around the clinic. The patient's family ran the model on the patient's records. The clinic was not in the loop.

Whether that's a good thing depends on which question you're asking. If the question is "should patients be doing their own LLM-based cross-record review," the honest answer is that they're going to anyway and the open question is whether they do it with good tools or bad ones. If the question is "should the healthcare system be building this kind of review into the workflow," the answer is yes but the system is not doing it on a useful timescale. The eleven errors did not surface because the model is smarter than the doctor. They surfaced because nobody else in the chain was being given the time and the context to look.

What the experiment really showed

The eleven errors are not a verdict on AI. They are a measurement of how much routine cross-record review primary care isn't delivering to patients with complicated histories.

The model's contribution was not insight. It was attention: specifically the kind of attention you can buy in a single API call but cannot buy in a fifteen-minute appointment. The records were already there. The errors were already in the records. Anyone with the time to read everything would have found them. Until very recently, "anyone with the time to read everything" was a category that did not include any actor in the patient's care. Now it includes a $0.60 prompt.

The interesting follow-up question isn't whether AI can replace doctors. It's what happens when patients with chronic complicated histories routinely show up with the eleven items already flagged, and the doctor has fifteen minutes to address them. The clinic doesn't get a vote on whether that future arrives. It is arriving, on the patient's laptop, today.

Top comments (0)