Nvidia and Eli Lilly announced a $1 billion AI drug discovery lab today at the J.P. Morgan Healthcare Conference. The press releases are full of the expected language: "reinvent drug discovery," "accelerate medicine development," "foundation models for biology." Lilly's CEO David Ricks said they're "combining our volumes of data and scientific knowledge with Nvidia's computational power."
But, MAN, there is a phrase in there that is doing an INSANE amount of hard word: Combining our volumes of data.
How, exactly?
The Missing Paragraph
The coverage has something conspicuously absent: any discussion of how pharma data actually moves. The lab will be in South San Francisco. Lilly's clinical trial data, compound libraries, and patient information live in facilities scattered across Indiana, Ireland, and dozens of research sites worldwide. The announcement talks about "lab-in-the-loop" systems linking wet labs and dry labs in "24/7 AI-assisted experimentation."
That's a lovely vision. It also assumes data flows like water between these locations. In pharma, it doesn't.
Why Pharma Data Is Different
Clinical trial data contains protected health information under HIPAA. Proprietary compound structures represent billions in R&D investment and competitive advantage. Manufacturing process data falls under FDA's 21 CFR Part 11, which mandates complete audit trails for every electronic record: who touched it, when, why, and what changed.
These aren't bureaucratic inconveniences that clever engineering can route around. They're structural constraints that exist because the consequences of failure are measured in patient safety and billion-dollar regulatory actions.
I've been talking to teams that operate in these environments. The pattern is consistent: they don't lack compute. They lack the ability to make their data usable without making it movable.
The Air Gap Paradox
Traditional security thinking offers two options. Lock data down completely in air-gapped environments where nothing gets out. Or open it up for analysis and accept the exfiltration risk. Pharma has mostly chosen the first option, which is why so much valuable data sits in protected directories that researchers can barely access.
The promise of AI drug discovery assumes you can train models on this data. But training requires moving data to compute, or moving compute to data. The first option triggers every compliance alarm in the building. The second option is what the press releases hand-wave past.
Security teams need something in between: protected environments where data scientists can actually work, but where every attempted data movement gets logged, analyzed, and blocked if it violates policy. Not just access controls (who can log in) but egress controls (what can leave). The ability to process data, transform it, analyze it, without ever letting raw records escape the protected perimeter.
This is remarkably hard to build. It's also not a GPU problem.
The Audit Trail Problem
21 CFR Part 11 requires that regulated companies maintain computer-generated, time-stamped audit trails recording every modification to electronic records.
Let’s say that again. Every Modification
The trail must include the operator identity, date/time, and the nature of the change.
Now imagine training a foundation model on clinical trial data. The model sees millions of records. It learns patterns. It generates new molecular structures based on those patterns. What's the audit trail for that? When the model suggests a compound, which training records influenced that suggestion? When a researcher uses an AI-generated insight to make a decision, how do you document the provenance?
These aren't hypothetical concerns. The FDA released draft guidance on AI in drug development in January 2025, outlining a risk-based credibility assessment framework for AI models across nonclinical, clinical, and manufacturing phases. Regulators are actively figuring out how to apply existing frameworks to machine learning systems. Companies that can demonstrate clean data lineage from source through model to output will have a structural advantage in regulatory discussions.
What Nvidia's Billion Dollars Actually Buys
Nvidia and Lilly aren't naive about these challenges. The announcement mentions that researchers will "generate large-scale data" in the lab itself, creating new datasets specifically designed for AI training. That sidesteps some of the legacy data problems.
The collaboration will (likely) use Nvidia's BioNeMo platform, an open framework for building and training deep learning models for drug discovery that's been adopted by over 200 techbios and large pharma companies. They're also focusing initial efforts on applications where data constraints are less severe: manufacturing optimization, process simulation, early-stage compound screening. These are real opportunities where GPU compute genuinely is the bottleneck.
But the highest-value problems in drug discovery involve the data that's hardest to access: longitudinal patient records from clinical trials, real-world evidence from treatment outcomes, proprietary biological assay results accumulated over decades of R&D. That data can't just be copied to a shiny new lab in South San Francisco. And the estimated $1-3 billion cost to develop a single new drug happens largely because of failures that better data access might prevent.
The Actual Hard Problem
The companies that figure out "compute over data" for regulated industries will eat this market. Not by building bigger GPU clusters, but by solving the governance layer that lets valuable data become usable without becoming vulnerable.
What does that look like in practice? Tagging data at the source with cryptographic fingerprints so you can always verify provenance. Processing pipelines that run inside protected perimeters with whitelist-only egress. Audit systems that log not just access but every transformation, every query, every attempted export. The ability to prove, at any point, exactly what happened to every record.
This is boring infrastructure work. It doesn't make for exciting keynote demos. But it's the actual constraint on AI-driven drug discovery, and throwing more GPUs at it doesn't help.
What I'd Watch For
If you're evaluating pharma AI investments, look past the compute announcements. Ask instead:
How does the company handle data that can't leave its current location?What's their approach to federated learning or on-premises model training?How do they maintain audit trails through AI-assisted workflows?What's their story for regulatory submissions that involve AI-generated insights?
The GPU buildout is the visible part of the iceberg. The governance layer underneath is where the actual differentiation happens.
Nvidia's bet will work for some use cases. Public datasets, synthetic data, newly generated experimental results. But the highest-value pharma AI problems live behind walls that compute power alone can't scale. The billion dollars is impressive. The missing paragraph about data governance is the real story.
Want to learn how intelligent data pipelines can reduce your AI costs? Check out Expanso. Or don't. Who am I to tell you what to do.*
NOTE: I'm currently writing a book based on what I have seen about the real-world challenges of data preparation for machine learning, focusing on operational, compliance, and cost. I'd love to hear your thoughts!
Originally published at Distributed Thoughts.
Top comments (0)