DEV Community

kent-tokyo
kent-tokyo

Posted on

Cheminformatics Databases in 2026: PubChem, ChEMBL, Regulatory Inventories, and API Access

When you try to pull chemical structure or bioactivity data via API, every database has its own endpoint design, its own license terms, and its own coverage. Research DBs, national regulatory inventories, and Snowflake as of May 2026.


Research DB Overview

DB Contents API Bulk Download License
PubChem Compounds, bioactivity, toxicity, patents REST ✓ (FTP / PUG Download) Public domain
ChEMBL Drug bioactivity REST + Python SDK ✓ (TSV / SDF / SQL, FTP) CC BY-SA 3.0
RCSB PDB Protein 3D structures GraphQL + REST ✓ (FTP / rsync) CC0
DrugBank Drug info, DDI REST (registration required) Suspended CC BY-NC 4.0
ZINC Purchasable compounds, 3D None ✓ (SDF / SMILES, FTP) Free
BindingDB Binding affinity (Ki, IC50, etc.) Limited REST ✓ (TSV, Excel-readable) CC BY 4.0
UniChem Cross-DB ID translation REST Free

PubChem: The First Stop

PubChem (NCBI) holds 119 million compounds and 295 million bioactivity records (NAR 2025), aggregating from over 1,000 sources including ChEMBL and DrugBank. Public domain.

# Compound name → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/cids/JSON"
# CID → SMILES
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2519/property/IsomericSMILES/JSON"
# InChIKey → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/cids/JSON"
Enter fullscreen mode Exit fullscreen mode

Rate limit: 5 requests/sec, 400 requests/min. For tens of thousands of records or more, use PUG Download (up to 250,000 records per request) or FTP.

PubChem aggregates from many sources, so conflicting values for the same compound are common. For IC50 data specifically, ChEMBL's curated records are more reliable.


ChEMBL: The Standard for ML Data

ChEMBL (EMBL-EBI) is a manually curated bioactivity database extracted from medicinal chemistry literature. It includes unit normalization and activity classification for IC50, Ki, and EC50 values, and has become the standard training data source for ML models.

As of ChEMBL 36 (September 2025): 2.878 million compounds, 24.27 million activity records.

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
m = molecule.get('CHEMBL25')
print(m['molecule_structures']['canonical_smiles'])

activity = new_client.activity
for a in activity.filter(target_chembl_id='CHEMBL205', standard_type='IC50')[:5]:
    print(a['molecule_chembl_id'], a['standard_value'], a['standard_units'])
Enter fullscreen mode Exit fullscreen mode

Full DB downloads are available via FTP in SQLite, MySQL, and PostgreSQL formats. pip install chembl-downloader handles reproducible retrieval.

CC BY-SA 3.0. Derivatives must be released under the same license.


RCSB PDB: Protein 3D Structures

RCSB PDB holds 254,000 experimentally determined structures (204,000 X-ray, 34,000 cryo-EM, 14,000 NMR). Fully open (CC0), with complete bulk download via FTP.

The GraphQL API lets you fetch structure info, ligands, and citations in a single request.

import requests
query = """{ entry(entry_id: "1ATP") {
  struct { title }
  rcsb_entry_info { resolution_combined }
} }"""
resp = requests.post("https://data.rcsb.org/graphql", json={"query": query})
print(resp.json()["data"]["entry"]["struct"]["title"])
Enter fullscreen mode Exit fullscreen mode

Cryo-EM structures have grown past 34,000 over the past few years, with large complexes and membrane proteins increasingly represented. AlphaFold 3 released source code and weights for non-commercial use between November 2024 and February 2025, enabling complex structure prediction for proteins, DNA, RNA, and small molecule ligands. AlphaFold DB (4.5 million users) predicted structures are accessible in parallel via RCSB.


DrugBank, ZINC, BindingDB

DrugBank covers 11,891 drug entries — 4,563 approved, 6,231 investigational — plus 1.41 million drug-drug interactions (DrugBank 6.0, 2024). Indications, mechanisms, metabolism, and DDI data are all included, making it the typical starting point for drug repurposing research.

⚠️ As of May 2026, academic dataset downloads are suspended (distribution method update in progress). API access continues.

ZINC22 is a virtual screening resource: approximately 55 billion 2D compounds and 5.9 billion 3D docking-ready compounds (official figures). No REST API — access is via FTP or the web GUI. Primarily used with UCSF DOCK and AutoDock.

BindingDB has 3.2 million protein–small molecule binding affinity records (Ki, IC50, Kd, etc.). The TSV download opens directly in Excel and shows up often in DTI benchmark sets.

UniChem handles ID translation only — ChEMBL ID, PubChem CID, InChIKey, DrugBank ID, and more. Any multi-DB pipeline will need it.

# UniChem 2.0 API (POST form recommended; v1 GET is legacy-compatible)
curl -X POST "https://www.ebi.ac.uk/unichem/api/v1/compounds" \
  -H "Content-Type: application/json" \
  -d '{"type":"inchikey","compound":"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"}'
Enter fullscreen mode Exit fullscreen mode

Japanese Databases (Free Only)

DB Contents API Bulk DL License
KEGG COMPOUND Metabolites, drugs, pathway integration REST (free for academic) Text format Academic free / commercial paid
SDBS (AIST) NMR/IR/MS spectra None Not available (50/day limit) Non-commercial free
NITE-CHRIP Japanese regulatory info (CSCL, ISHL, etc.) None Partial list Free
Nikkaji RDF (NBDC) 3.6M compounds from Nikkaji Bulk DL (SPARQL) RDF/TTL only CC BY 4.0

KEGG COMPOUND integrates 19,572 pathway-registered compounds and 12,826 drugs with genomic and disease data. The REST API supports compound lookup, pathway cross-referencing, and BRITE hierarchy traversal. Commercial use requires a license via Pathway Solutions.

# Fetch compound by C number
curl "https://rest.kegg.jp/get/C00031"
# Convert KEGG compound IDs to ChEBI IDs
curl "https://rest.kegg.jp/conv/chebi/compound"
Enter fullscreen mode Exit fullscreen mode

SDBS (AIST) covers approximately 34,600 compounds with manually curated spectral data (FT-IR, EI-MS, ¹H NMR, ¹³C NMR). High reliability due to expert curation, but no API and a 50-spectra-per-day download limit.

NITE-CHRIP provides cross-searchable regulatory information for approximately 300,000 substances under Japanese law (CSCL, ISHL, Poisonous and Deleterious Substances Control Act, etc.). Updated every two months. GHS classification lists are partially downloadable; the Excel format is sold separately by CIRS Group.

Nikkaji RDF (NBDC) is the RDF release of the Japanese Chemical Substance Dictionary (Nikkaji). Over 3.6 million compounds, CC BY 4.0 — the only Japanese DB that allows commercial bulk download. SPARQL access required.


Regulatory Databases by Country

Chemical substance inventories and regulatory DBs from various jurisdictions.

DB Region Coverage API CSV/DL English
ECHA CHEM EU C&L: 350K substances Not available ✓ (Open Data Portal) Yes
eChemPortal OECD Multi-country Unconfirmed Yes
EFSA OpenFoodTox EU 7,880 substances Yes ✓ (Zenodo) Yes
NCIS Korea Full KECL Unconfirmed Yes
CCISS China (3rd-party) IECSC 47K substances None Yes
DSL Canada 28K substances None ✓ (CSV/XLSX) Yes
AIIC Australia 40K substances None ✓ (spreadsheet) Yes
NITE-CHRIP Japan 300K substances None Partial Yes

EU: ECHA CHEM

ECHA CHEM is run by the European Chemicals Agency and is the largest public chemical DB. It covers the C&L Inventory (4,400+ harmonized classifications, 7 million+ industry self-classifications) and REACH registration data (physicochemical, toxicological, and ecotoxicological test results). Relaunched in January 2024, with the C&L module integrated in May 2025. Bulk data is available from the ECHA Open Data Portal.

An official REST API is not yet available (announced as a future feature).

eChemPortal (OECD) provides a single search interface across regulatory DBs from ECHA REACH, the US EPA, Japan's NITE-CHRIP, and others. It covers 27,000 REACH-registered substances and over 1.3 million endpoint records.

EFSA OpenFoodTox is a toxicological evaluation dataset for 7,880 food-relevant substances (food additives, pesticides, contaminants, etc.). Accessible via the Open EFSA API and downloadable through Zenodo.

China: CCISS / IECSC

China's chemical inventory is the IECSC (现有化学物质名录), maintained by MEE (Ministry of Ecology and Environment). It covers 47,000 substances, but the official site is Chinese-only.

The practical entry point is CCISS (a free tool by CIRS Group), which provides an English-language interface for searching IECSC by CAS number or English name. It is not official, but update frequency and accuracy are reliable. For anyone who cannot read Chinese, CCISS is effectively the only access point.

Korea: NCIS

NCIS (National Chemical Information System) is the official DB run by the National Institute of Environmental Research (NIER). It covers KECL (Korea Existing Chemicals List) for K-REACH compliance, hazardous chemical lists, and GHS classification data. It is one of the few Asian government chemical DBs with an English interface.

Canada and Australia

Canada's DSL (Domestic Substances List, 28,000 substances) is available as CSV/XLSX from the Open Government Portal.

Australia's AIIC (Australian Inventory of Industrial Chemicals, 40,000 substances), maintained by AICIS, is published in spreadsheet format twice a year. Neither has an API; bulk download is the only programmatic option.


Snowflake Marketplace

There are no free chemical structure DB listings.

Major chemical structure DBs such as PubChem and ChEMBL are not available as Data Shares on the Snowflake Marketplace. Commercial listings like IQVIA (clinical and prescription data) and DrugPatentWatch (patent data) exist, but they do not contain molecular structure data.

Snowflake works as a platform rather than a data source for chemistry:

  • RDKit can be installed as a Snowpark Python UDF, making fingerprint calculation and similarity search available as SQL
  • ChEMBL and PubChem data are loaded via FTP download and Snowpark ingestion as the standard workflow
  • AWS terminated its S3 hosting of ChEMBL, so FTP is now the standard retrieval method
# RDKit via Snowpark Python UDF
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType

@udf(return_type=StringType(), input_types=[StringType()])
def canonical_smiles(smiles: str) -> str:
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol) if mol else None
Enter fullscreen mode Exit fullscreen mode

As of May 2026, there is no indication that PubChem or ChEMBL are participating as Snowflake Data Shares.


CAS Number Search

CAS numbers (e.g., 50-78-2 for aspirin) are the practical cross-DB identifier. API support varies by database.

DB CAS Search Method
PubChem REST API (treated as compound name)
ChEMBL Python SDK (synonym filter)
KEGG REST API
UniChem POST API (cross-reference)
ECHA CHEM Web UI only (no API)
NITE-CHRIP Web UI only
NCIS (Korea) Web UI only
CCISS (China) Web UI only
DSL (Canada) Web UI + CSV download
AIIC (Australia) Web UI + spreadsheet

PubChem

CAS numbers can be passed directly as compound names.

# CAS number → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/cids/JSON"

# CAS number → SMILES, molecular weight, and formula in one request
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/property/MolecularFormula,MolecularWeight,IsomericSMILES/JSON"
Enter fullscreen mode Exit fullscreen mode

ChEMBL

CAS numbers are stored in ChEMBL's synonyms table.

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
results = list(molecule.filter(molecule_synonyms__synonym='50-78-2'))
if results:
    m = results[0]
    print(m['molecule_chembl_id'])
    print(m['molecule_structures']['canonical_smiles'])
Enter fullscreen mode Exit fullscreen mode

KEGG

The KEGG REST API accepts CAS numbers directly in the find endpoint.

# Search KEGG compounds by CAS number
curl "https://rest.kegg.jp/find/compound/50-78-2"

# Returns TSV like: C01405\t50-78-2
# Then fetch details by C number
curl "https://rest.kegg.jp/get/C01405"
Enter fullscreen mode Exit fullscreen mode

The find endpoint also accepts compound names, molecular formulas, and molecular weight ranges.

Bulk CAS → SMILES via PubChem

For pipeline use, the PUG REST POST endpoint handles batch conversion.

import requests

cas_list = ['50-78-2', '64-17-5', '7732-18-5']  # aspirin, ethanol, water

resp = requests.post(
    "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/property/IsomericSMILES/JSON",
    data={'name': '\n'.join(cas_list)}
)

for prop in resp.json().get('PropertyTable', {}).get('Properties', []):
    print(prop['CID'], prop['IsomericSMILES'])
Enter fullscreen mode Exit fullscreen mode

PubChem treats CAS numbers as chemical names. One CAS number can map to multiple CIDs (salts, hydrates, stereoisomers) — if you get multiple results, take the first CID or filter by InChIKey.


Data Quality

A 2025 EPA paper flagged the propagation of incorrect CAS numbers and stereochemical information across multiple databases. PubChem aggregates from many sources, so conflicting values between them are common. For publications or regulatory submissions, check the primary source — literature or experimental data.


Sources

Top comments (0)