When you try to pull chemical structure or bioactivity data via API, every database has its own endpoint design, its own license terms, and its own coverage. Research DBs, national regulatory inventories, and Snowflake as of May 2026.
Research DB Overview
| DB | Contents | API | Bulk Download | License |
|---|---|---|---|---|
| PubChem | Compounds, bioactivity, toxicity, patents | REST | ✓ (FTP / PUG Download) | Public domain |
| ChEMBL | Drug bioactivity | REST + Python SDK | ✓ (TSV / SDF / SQL, FTP) | CC BY-SA 3.0 |
| RCSB PDB | Protein 3D structures | GraphQL + REST | ✓ (FTP / rsync) | CC0 |
| DrugBank | Drug info, DDI | REST (registration required) | Suspended | CC BY-NC 4.0 |
| ZINC | Purchasable compounds, 3D | None | ✓ (SDF / SMILES, FTP) | Free |
| BindingDB | Binding affinity (Ki, IC50, etc.) | Limited REST | ✓ (TSV, Excel-readable) | CC BY 4.0 |
| UniChem | Cross-DB ID translation | REST | — | Free |
PubChem: The First Stop
PubChem (NCBI) holds 119 million compounds and 295 million bioactivity records (NAR 2025), aggregating from over 1,000 sources including ChEMBL and DrugBank. Public domain.
# Compound name → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/cids/JSON"
# CID → SMILES
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2519/property/IsomericSMILES/JSON"
# InChIKey → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/cids/JSON"
Rate limit: 5 requests/sec, 400 requests/min. For tens of thousands of records or more, use PUG Download (up to 250,000 records per request) or FTP.
PubChem aggregates from many sources, so conflicting values for the same compound are common. For IC50 data specifically, ChEMBL's curated records are more reliable.
ChEMBL: The Standard for ML Data
ChEMBL (EMBL-EBI) is a manually curated bioactivity database extracted from medicinal chemistry literature. It includes unit normalization and activity classification for IC50, Ki, and EC50 values, and has become the standard training data source for ML models.
As of ChEMBL 36 (September 2025): 2.878 million compounds, 24.27 million activity records.
from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
m = molecule.get('CHEMBL25')
print(m['molecule_structures']['canonical_smiles'])
activity = new_client.activity
for a in activity.filter(target_chembl_id='CHEMBL205', standard_type='IC50')[:5]:
print(a['molecule_chembl_id'], a['standard_value'], a['standard_units'])
Full DB downloads are available via FTP in SQLite, MySQL, and PostgreSQL formats. pip install chembl-downloader handles reproducible retrieval.
CC BY-SA 3.0. Derivatives must be released under the same license.
RCSB PDB: Protein 3D Structures
RCSB PDB holds 254,000 experimentally determined structures (204,000 X-ray, 34,000 cryo-EM, 14,000 NMR). Fully open (CC0), with complete bulk download via FTP.
The GraphQL API lets you fetch structure info, ligands, and citations in a single request.
import requests
query = """{ entry(entry_id: "1ATP") {
struct { title }
rcsb_entry_info { resolution_combined }
} }"""
resp = requests.post("https://data.rcsb.org/graphql", json={"query": query})
print(resp.json()["data"]["entry"]["struct"]["title"])
Cryo-EM structures have grown past 34,000 over the past few years, with large complexes and membrane proteins increasingly represented. AlphaFold 3 released source code and weights for non-commercial use between November 2024 and February 2025, enabling complex structure prediction for proteins, DNA, RNA, and small molecule ligands. AlphaFold DB (4.5 million users) predicted structures are accessible in parallel via RCSB.
DrugBank, ZINC, BindingDB
DrugBank covers 11,891 drug entries — 4,563 approved, 6,231 investigational — plus 1.41 million drug-drug interactions (DrugBank 6.0, 2024). Indications, mechanisms, metabolism, and DDI data are all included, making it the typical starting point for drug repurposing research.
⚠️ As of May 2026, academic dataset downloads are suspended (distribution method update in progress). API access continues.
ZINC22 is a virtual screening resource: approximately 55 billion 2D compounds and 5.9 billion 3D docking-ready compounds (official figures). No REST API — access is via FTP or the web GUI. Primarily used with UCSF DOCK and AutoDock.
BindingDB has 3.2 million protein–small molecule binding affinity records (Ki, IC50, Kd, etc.). The TSV download opens directly in Excel and shows up often in DTI benchmark sets.
UniChem handles ID translation only — ChEMBL ID, PubChem CID, InChIKey, DrugBank ID, and more. Any multi-DB pipeline will need it.
# UniChem 2.0 API (POST form recommended; v1 GET is legacy-compatible)
curl -X POST "https://www.ebi.ac.uk/unichem/api/v1/compounds" \
-H "Content-Type: application/json" \
-d '{"type":"inchikey","compound":"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"}'
Japanese Databases (Free Only)
| DB | Contents | API | Bulk DL | License |
|---|---|---|---|---|
| KEGG COMPOUND | Metabolites, drugs, pathway integration | REST (free for academic) | Text format | Academic free / commercial paid |
| SDBS (AIST) | NMR/IR/MS spectra | None | Not available (50/day limit) | Non-commercial free |
| NITE-CHRIP | Japanese regulatory info (CSCL, ISHL, etc.) | None | Partial list | Free |
| Nikkaji RDF (NBDC) | 3.6M compounds from Nikkaji | Bulk DL (SPARQL) | RDF/TTL only | CC BY 4.0 |
KEGG COMPOUND integrates 19,572 pathway-registered compounds and 12,826 drugs with genomic and disease data. The REST API supports compound lookup, pathway cross-referencing, and BRITE hierarchy traversal. Commercial use requires a license via Pathway Solutions.
# Fetch compound by C number
curl "https://rest.kegg.jp/get/C00031"
# Convert KEGG compound IDs to ChEBI IDs
curl "https://rest.kegg.jp/conv/chebi/compound"
SDBS (AIST) covers approximately 34,600 compounds with manually curated spectral data (FT-IR, EI-MS, ¹H NMR, ¹³C NMR). High reliability due to expert curation, but no API and a 50-spectra-per-day download limit.
NITE-CHRIP provides cross-searchable regulatory information for approximately 300,000 substances under Japanese law (CSCL, ISHL, Poisonous and Deleterious Substances Control Act, etc.). Updated every two months. GHS classification lists are partially downloadable; the Excel format is sold separately by CIRS Group.
Nikkaji RDF (NBDC) is the RDF release of the Japanese Chemical Substance Dictionary (Nikkaji). Over 3.6 million compounds, CC BY 4.0 — the only Japanese DB that allows commercial bulk download. SPARQL access required.
Regulatory Databases by Country
Chemical substance inventories and regulatory DBs from various jurisdictions.
| DB | Region | Coverage | API | CSV/DL | English |
|---|---|---|---|---|---|
| ECHA CHEM | EU | C&L: 350K substances | Not available | ✓ (Open Data Portal) | Yes |
| eChemPortal | OECD | Multi-country | Unconfirmed | — | Yes |
| EFSA OpenFoodTox | EU | 7,880 substances | Yes | ✓ (Zenodo) | Yes |
| NCIS | Korea | Full KECL | Unconfirmed | — | Yes |
| CCISS | China (3rd-party) | IECSC 47K substances | None | — | Yes |
| DSL | Canada | 28K substances | None | ✓ (CSV/XLSX) | Yes |
| AIIC | Australia | 40K substances | None | ✓ (spreadsheet) | Yes |
| NITE-CHRIP | Japan | 300K substances | None | Partial | Yes |
EU: ECHA CHEM
ECHA CHEM is run by the European Chemicals Agency and is the largest public chemical DB. It covers the C&L Inventory (4,400+ harmonized classifications, 7 million+ industry self-classifications) and REACH registration data (physicochemical, toxicological, and ecotoxicological test results). Relaunched in January 2024, with the C&L module integrated in May 2025. Bulk data is available from the ECHA Open Data Portal.
An official REST API is not yet available (announced as a future feature).
eChemPortal (OECD) provides a single search interface across regulatory DBs from ECHA REACH, the US EPA, Japan's NITE-CHRIP, and others. It covers 27,000 REACH-registered substances and over 1.3 million endpoint records.
EFSA OpenFoodTox is a toxicological evaluation dataset for 7,880 food-relevant substances (food additives, pesticides, contaminants, etc.). Accessible via the Open EFSA API and downloadable through Zenodo.
China: CCISS / IECSC
China's chemical inventory is the IECSC (现有化学物质名录), maintained by MEE (Ministry of Ecology and Environment). It covers 47,000 substances, but the official site is Chinese-only.
The practical entry point is CCISS (a free tool by CIRS Group), which provides an English-language interface for searching IECSC by CAS number or English name. It is not official, but update frequency and accuracy are reliable. For anyone who cannot read Chinese, CCISS is effectively the only access point.
Korea: NCIS
NCIS (National Chemical Information System) is the official DB run by the National Institute of Environmental Research (NIER). It covers KECL (Korea Existing Chemicals List) for K-REACH compliance, hazardous chemical lists, and GHS classification data. It is one of the few Asian government chemical DBs with an English interface.
Canada and Australia
Canada's DSL (Domestic Substances List, 28,000 substances) is available as CSV/XLSX from the Open Government Portal.
Australia's AIIC (Australian Inventory of Industrial Chemicals, 40,000 substances), maintained by AICIS, is published in spreadsheet format twice a year. Neither has an API; bulk download is the only programmatic option.
Snowflake Marketplace
There are no free chemical structure DB listings.
Major chemical structure DBs such as PubChem and ChEMBL are not available as Data Shares on the Snowflake Marketplace. Commercial listings like IQVIA (clinical and prescription data) and DrugPatentWatch (patent data) exist, but they do not contain molecular structure data.
Snowflake works as a platform rather than a data source for chemistry:
- RDKit can be installed as a Snowpark Python UDF, making fingerprint calculation and similarity search available as SQL
- ChEMBL and PubChem data are loaded via FTP download and Snowpark ingestion as the standard workflow
- AWS terminated its S3 hosting of ChEMBL, so FTP is now the standard retrieval method
# RDKit via Snowpark Python UDF
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType
@udf(return_type=StringType(), input_types=[StringType()])
def canonical_smiles(smiles: str) -> str:
from rdkit import Chem
mol = Chem.MolFromSmiles(smiles)
return Chem.MolToSmiles(mol) if mol else None
As of May 2026, there is no indication that PubChem or ChEMBL are participating as Snowflake Data Shares.
CAS Number Search
CAS numbers (e.g., 50-78-2 for aspirin) are the practical cross-DB identifier. API support varies by database.
| DB | CAS Search | Method |
|---|---|---|
| PubChem | ✓ | REST API (treated as compound name) |
| ChEMBL | ✓ | Python SDK (synonym filter) |
| KEGG | ✓ | REST API |
| UniChem | ✓ | POST API (cross-reference) |
| ECHA CHEM | ✓ | Web UI only (no API) |
| NITE-CHRIP | ✓ | Web UI only |
| NCIS (Korea) | ✓ | Web UI only |
| CCISS (China) | ✓ | Web UI only |
| DSL (Canada) | ✓ | Web UI + CSV download |
| AIIC (Australia) | ✓ | Web UI + spreadsheet |
PubChem
CAS numbers can be passed directly as compound names.
# CAS number → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/cids/JSON"
# CAS number → SMILES, molecular weight, and formula in one request
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/property/MolecularFormula,MolecularWeight,IsomericSMILES/JSON"
ChEMBL
CAS numbers are stored in ChEMBL's synonyms table.
from chembl_webresource_client.new_client import new_client
molecule = new_client.molecule
results = list(molecule.filter(molecule_synonyms__synonym='50-78-2'))
if results:
m = results[0]
print(m['molecule_chembl_id'])
print(m['molecule_structures']['canonical_smiles'])
KEGG
The KEGG REST API accepts CAS numbers directly in the find endpoint.
# Search KEGG compounds by CAS number
curl "https://rest.kegg.jp/find/compound/50-78-2"
# Returns TSV like: C01405\t50-78-2
# Then fetch details by C number
curl "https://rest.kegg.jp/get/C01405"
The find endpoint also accepts compound names, molecular formulas, and molecular weight ranges.
Bulk CAS → SMILES via PubChem
For pipeline use, the PUG REST POST endpoint handles batch conversion.
import requests
cas_list = ['50-78-2', '64-17-5', '7732-18-5'] # aspirin, ethanol, water
resp = requests.post(
"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/property/IsomericSMILES/JSON",
data={'name': '\n'.join(cas_list)}
)
for prop in resp.json().get('PropertyTable', {}).get('Properties', []):
print(prop['CID'], prop['IsomericSMILES'])
PubChem treats CAS numbers as chemical names. One CAS number can map to multiple CIDs (salts, hydrates, stereoisomers) — if you get multiple results, take the first CID or filter by InChIKey.
Data Quality
A 2025 EPA paper flagged the propagation of incorrect CAS numbers and stereochemical information across multiple databases. PubChem aggregates from many sources, so conflicting values between them are common. For publications or regulatory submissions, check the primary source — literature or experimental data.
Sources
- PubChem 2025 update | Nucleic Acids Research
- ChEMBL 35 is out | ChEMBL Blog
- ChEMBL 36 is live | EMBL-EBI
- ZINC-22 | PMC
- DrugBank 6.0 | Nucleic Acids Research
- BindingDB in 2024 | PMC
- RCSB PDB 2025 milestone | RCSB
- AlphaFold Protein Structure Database 2025 | PMC
- KEGG API Manual
- NITE-CHRIP | NITE
- ECHA CHEM
- eChemPortal | OECD
- EFSA OpenFoodTox
- NCIS | NIER Korea
- CCISS | CIRS Group
- DSL | Canada
- AIIC | AICIS Australia
- DrugPatentWatch on Snowflake Marketplace
- Cheminformatics in Snowflake | Medium
Top comments (0)