SandboxAQ, an AI and quantum tech company backed by Nvidia, has released a powerful dataset of over 5.2 million three-dimensional molecular structures. The goal is to supercharge AI-driven drug discovery by providing researchers and developers with a detailed, high-quality resource for building predictive models.
What’s in the Dataset
- Over 5.2 million molecular conformers in SDF format
- Molecular identifiers (SMILES, InChIKey)
- Simulated binding affinities
- Descriptors like polarity, charge, and scaffold classification
The molecules span a chemically diverse space with targets relevant to oncology, immunology, and neurological diseases. The data were generated using quantum-informed simulations, followed by ML-guided filtering for drug-likeness.
Example Usage in Python
from sandboxaq_loader import MoleculeDataset
dataset = MoleculeDataset("sandboxaq_5m.sdf")
for mol in dataset:
print(mol.smiles, mol.binding_affinity)
DataFrame Conversion Example
import pandas as pd
df = dataset.to_dataframe()
print(df[df["target_class"] == "oncology"].head())
Why It Matters
Drug discovery is time-consuming and expensive. This dataset allows AI models to learn from a vast chemical space before lab validation, saving both money and years of work. By combining physics, quantum chemistry, and machine learning, SandboxAQ is helping reshape how we discover new medicines.
What’s Next
The dataset is free for academic use. SandboxAQ plans to launch benchmarking challenges and expand access to industry researchers.
Sources
https://www.reuters.com/business/healthcare-pharmaceuticals/nvidia-backed-ai-startup-sandboxaq-creates-new-data-speed-up-drug-discovery-2025-06-18/
https://www.sandboxaq.com/press/2025-drug-discovery-dataset-release
Top comments (0)