DEV Community

Maurizio Morri
Maurizio Morri

Posted on

# SandboxAQ Releases 5 Million Molecule Dataset for Drug Discovery

SandboxAQ, an AI and quantum tech company backed by Nvidia, has released a powerful dataset of over 5.2 million three-dimensional molecular structures. The goal is to supercharge AI-driven drug discovery by providing researchers and developers with a detailed, high-quality resource for building predictive models.

What’s in the Dataset

  • Over 5.2 million molecular conformers in SDF format
  • Molecular identifiers (SMILES, InChIKey)
  • Simulated binding affinities
  • Descriptors like polarity, charge, and scaffold classification

The molecules span a chemically diverse space with targets relevant to oncology, immunology, and neurological diseases. The data were generated using quantum-informed simulations, followed by ML-guided filtering for drug-likeness.

Example Usage in Python

from sandboxaq_loader import MoleculeDataset

dataset = MoleculeDataset("sandboxaq_5m.sdf")
for mol in dataset:
print(mol.smiles, mol.binding_affinity)

Enter fullscreen mode Exit fullscreen mode




DataFrame Conversion Example


import pandas as pd

df = dataset.to_dataframe()
print(df[df["target_class"] == "oncology"].head())

Enter fullscreen mode Exit fullscreen mode




Why It Matters

Drug discovery is time-consuming and expensive. This dataset allows AI models to learn from a vast chemical space before lab validation, saving both money and years of work. By combining physics, quantum chemistry, and machine learning, SandboxAQ is helping reshape how we discover new medicines.

What’s Next

The dataset is free for academic use. SandboxAQ plans to launch benchmarking challenges and expand access to industry researchers.

Sources

https://www.reuters.com/business/healthcare-pharmaceuticals/nvidia-backed-ai-startup-sandboxaq-creates-new-data-speed-up-drug-discovery-2025-06-18/

https://www.sandboxaq.com/press/2025-drug-discovery-dataset-release

Top comments (0)