DEV Community

Satish Singh
Satish Singh

Posted on

We scanned 3,000 healthcare repositories. Here's what we found in CDC, VA, NHS, and Google's code.

Every year, healthcare organizations spend billions on compliance.
Auditors review policies. Security teams run vulnerability scans.
Certifications get renewed. And yet the actual code running on
production healthcare systems, the code that handles your medical
records, your Social Security Number, your vaccination history,
remains largely unexamined.

We decided to examine it.

Over the past several months, we built a static analysis engine
that reads healthcare source code the way a compliance auditor
would, mapping code patterns directly to specific HIPAA sections,
GDPR articles, SOC 2 criteria, and India's DPDPA requirements.
Then we pointed it at 3,000 public healthcare repositories spanning
9 programming languages and 4 continents.

13,427 confirmed violations. 43.6% of repositories affected.

The organizations involved are not small or obscure. They include
the US Centers for Disease Control, the US Department of Veterans
Affairs, NHS England, Google, and some of the most widely deployed
open-source healthcare platforms in the world.

The gap nobody is closing

Here is the compliance problem nobody talks about.

Security scanners like Snyk and Semgrep find known vulnerabilities,
outdated dependencies, common attack patterns, CVEs. Compliance
audits check whether policies exist, whether a Business Associate
Agreement is signed, whether an access control policy is documented.

Neither examines whether the application code actually implements
the safeguards that regulations require.

A hospital can have a perfect HIPAA policy document and a clean
Snyk scan while its billing export writes every patient's Social
Security Number to a plaintext CSV file on the server filesystem.
That is not a hypothetical. That is OpenEMR, the most widely
deployed open-source EMR with over 100,000 installations worldwide.

This is the gap. And it is systemic.

What we actually found

The VA knew and suppressed it

The US Department of Veterans Affairs notification-api handles SMS,
email, and push notifications for 9 million veterans deployed to
AWS GovCloud.

One Lambda function disables TLS certificate verification with
verify=False. Alongside it sits an explicit # nosec annotation,
a security scanner suppression comment used by developers to silence
warnings they don't want to fix.

The development team was aware this was a security issue. They
suppressed the warning and deployed to production anyway. Veteran
phone numbers and SMS content are logged in plaintext. All affected
functions deploy across dev, staging, perf, and prod environments
via GitHub Actions.

NHS England queries 58 million patient records with TLS disabled

OpenSAFELY is a secure analytics platform for NHS England
electronic health records. The cohort-extractor tool queries
patient data from approximately 58 million NHS patients.

TLS certificate verification is unconditionally disabled for its
EMIS database connection. Security warnings are globally suppressed.
A TODO comment in the code confirms the developers are aware:

# TODO remove this when certificate verification reinstated

The TODO is still there.

India's vaccination platform logged Aadhaar numbers to stdout

India's DIVOC platform powered the national COVID vaccination
certificate system used by hundreds of millions of Indian citizens.

The code serializes the entire certificate request to application
logs, Aadhaar number, name, date of birth, gender, phone number,
home address, for every single certificate created. A separate
analytics consumer prints every Kafka vaccination message to stdout
unconditionally with no feature flag, no log level gate, and no
way to disable it without modifying source code.

Production CoWIN URLs in the Kubernetes deployment configs confirm
this ran on India's national vaccination infrastructure.

We reported this to CERT-In on March 23, 2026.

OpenEMR writes patient SSNs to plaintext CSV files

OpenEMR's billing export feature writes patient Social Security
Numbers, names, dates of birth, addresses, and phone numbers to a
plaintext CSV file via fwrite(). Zero encryption. No audit trail.

When we contacted the OpenEMR security team, they confirmed this
is intended functionality, citing HIPAA's "addressable"
specification as placing the obligation on the deploying
organization.

100,000+ installations. Every one generating unencrypted files
containing the most sensitive category of patient data with no
application-level option to encrypt them.

The finding that should concern everyone most

Of everything we found, one pattern stands out as the most
forward-looking risk.

Across 3,000 repositories, our analysis detected 657 confirmed
instances of patient medical data flowing into AI and machine
learning pipelines without de-identification. This includes CSV
exports fed into model training, inference calls containing
identifiable patient records, and analytics pipelines processing
raw PHI.

Metriport, a funded healthcare API company, sends patient medical
record data to an AI model via Amazon Bedrock. No de-identification
or tokenization is visible before the API call.

This is not a niche finding. As healthcare organizations race to
adopt AI for clinical decision support, triage automation, and
population health analytics, PHI is flowing into these pipelines
at scale. Traditional security scanners were not built to detect
this pattern. Most compliance frameworks have not yet caught up
with it either.

The code is ahead of the regulations. And the data is already
moving.

Why this keeps happening

Compliance failure in our dataset does not correlate with funding,
team size, or institutional credibility. The VA has significant
engineering resources. NHS England runs one of the largest health
data platforms in the world. Google has some of the best security
engineers on the planet.

The correlation is simpler: nobody checked the code.

Not because organizations are negligent. Because the tools and
processes that exist today were not designed to check it. Security
scanners operate at the dependency and vulnerability layer.
Compliance audits operate at the policy and process layer. The
application code layer, where PHI actually moves, where encryption
actually gets implemented or skipped, where logging decisions
actually get made, sits between these two worlds largely unexamined.

That is the gap. And until it gets closed, compliance certifications
will continue to mean less than they should.

What this means for patients

You have no way of knowing whether the healthcare application
handling your data was ever checked at the code level for
compliance. Your provider may have passed a HIPAA audit. Their
software vendor may have a SOC 2 certificate. Neither guarantees
that your Social Security Number is not being written to a plaintext
file somewhere on a server filesystem.

That is not a reason for panic. It is a reason for the industry
to close a gap that has been ignored for too long.

All affected organizations were notified through responsible
disclosure channels prior to publication.

Top comments (1)

Collapse
 
nirvahana profile image
Satish Singh

Full technical writeup with methodology: securehealth-ai.com/blog/we-scanne...