# SandboxAQ Releases 5 Million Molecule Dataset for Drug Discovery

#programming #tech

SandboxAQ, an AI and quantum tech company backed by Nvidia, has released a powerful dataset of over 5.2 million three-dimensional molecular structures. The goal is to supercharge AI-driven drug discovery by providing researchers and developers with a detailed, high-quality resource for building predictive models.

What’s in the Dataset

Over 5.2 million molecular conformers in SDF format
Molecular identifiers (SMILES, InChIKey)
Simulated binding affinities
Descriptors like polarity, charge, and scaffold classification

The molecules span a chemically diverse space with targets relevant to oncology, immunology, and neurological diseases. The data were generated using quantum-informed simulations, followed by ML-guided filtering for drug-likeness.

Example Usage in Python

from sandboxaq_loader import MoleculeDataset

dataset = MoleculeDataset("sandboxaq_5m.sdf")

for mol in dataset:

    print(mol.smiles, mol.binding_affinity)

DataFrame Conversion Example

import pandas as pd

df = dataset.to_dataframe()

print(df[df["target_class"] == "oncology"].head())

Why It Matters

Drug discovery is time-consuming and expensive. This dataset allows AI models to learn from a vast chemical space before lab validation, saving both money and years of work. By combining physics, quantum chemistry, and machine learning, SandboxAQ is helping reshape how we discover new medicines.