kent-tokyo

Posted on May 27

Cheminformatics Databases in 2026: PubChem, ChEMBL, Regulatory Inventories, and API Access

#cheminformatics #chemistry #api #bioinformatics

When you try to pull chemical structure or bioactivity data via API, every database has its own endpoint design, its own license terms, and its own coverage. Research DBs, national regulatory inventories, and Snowflake as of May 2026.

Research DB Overview

DB	Contents	API	Bulk Download	License
PubChem	Compounds, bioactivity, toxicity, patents	REST	✓ (FTP / PUG Download)	Public domain
ChEMBL	Drug bioactivity	REST + Python SDK	✓ (TSV / SDF / SQL, FTP)	CC BY-SA 3.0
RCSB PDB	Protein 3D structures	GraphQL + REST	✓ (FTP / rsync)	CC0
DrugBank	Drug info, DDI	REST (registration required)	Suspended	CC BY-NC 4.0
ZINC	Purchasable compounds, 3D	None	✓ (SDF / SMILES, FTP)	Free
BindingDB	Binding affinity (Ki, IC50, etc.)	Limited REST	✓ (TSV, Excel-readable)	CC BY 4.0
UniChem	Cross-DB ID translation	REST	—	Free

PubChem: The First Stop

PubChem (NCBI) holds 119 million compounds and 295 million bioactivity records (NAR 2025), aggregating from over 1,000 sources including ChEMBL and DrugBank. Public domain.

# Compound name → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/cids/JSON"
# CID → SMILES
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2519/property/IsomericSMILES/JSON"
# InChIKey → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/cids/JSON"

Rate limit: 5 requests/sec, 400 requests/min. For tens of thousands of records or more, use PUG Download (up to 250,000 records per request) or FTP.

PubChem aggregates from many sources, so conflicting values for the same compound are common. For IC50 data specifically, ChEMBL's curated records are more reliable.

ChEMBL: The Standard for ML Data

ChEMBL (EMBL-EBI) is a manually curated bioactivity database extracted from medicinal chemistry literature. It includes unit normalization and activity classification for IC50, Ki, and EC50 values, and has become the standard training data source for ML models.

As of ChEMBL 36 (September 2025): 2.878 million compounds, 24.27 million activity records.

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
m = molecule.get('CHEMBL25')
print(m['molecule_structures']['canonical_smiles'])

activity = new_client.activity
for a in activity.filter(target_chembl_id='CHEMBL205', standard_type='IC50')[:5]:
    print(a['molecule_chembl_id'], a['standard_value'], a['standard_units'])

Full DB downloads are available via FTP in SQLite, MySQL, and PostgreSQL formats. pip install chembl-downloader handles reproducible retrieval.

CC BY-SA 3.0. Derivatives must be released under the same license.

RCSB PDB: Protein 3D Structures

RCSB PDB holds 254,000 experimentally determined structures (204,000 X-ray, 34,000 cryo-EM, 14,000 NMR). Fully open (CC0), with complete bulk download via FTP.

The GraphQL API lets you fetch structure info, ligands, and citations in a single request.

import requests
query = """{ entry(entry_id: "1ATP") {
  struct { title }
  rcsb_entry_info { resolution_combined }
} }"""
resp = requests.post("https://data.rcsb.org/graphql", json={"query": query})
print(resp.json()["data"]["entry"]["struct"]["title"])

Cryo-EM structures have grown past 34,000 over the past few years, with large complexes and membrane proteins increasingly represented. AlphaFold 3 released source code and weights for non-commercial use between November 2024 and February 2025, enabling complex structure prediction for proteins, DNA, RNA, and small molecule ligands. AlphaFold DB (4.5 million users) predicted structures are accessible in parallel via RCSB.

DrugBank, ZINC, BindingDB

DrugBank covers 11,891 drug entries — 4,563 approved, 6,231 investigational — plus 1.41 million drug-drug interactions (DrugBank 6.0, 2024). Indications, mechanisms, metabolism, and DDI data are all included, making it the typical starting point for drug repurposing research.

⚠️ As of May 2026, academic dataset downloads are suspended (distribution method update in progress). API access continues.

ZINC22 is a virtual screening resource: approximately 55 billion 2D compounds and 5.9 billion 3D docking-ready compounds (official figures). No REST API — access is via FTP or the web GUI. Primarily used with UCSF DOCK and AutoDock.

BindingDB has 3.2 million protein–small molecule binding affinity records (Ki, IC50, Kd, etc.). The TSV download opens directly in Excel and shows up often in DTI benchmark sets.

UniChem handles ID translation only — ChEMBL ID, PubChem CID, InChIKey, DrugBank ID, and more. Any multi-DB pipeline will need it.

# UniChem 2.0 API (POST form recommended; v1 GET is legacy-compatible)
curl -X POST "https://www.ebi.ac.uk/unichem/api/v1/compounds" \
  -H "Content-Type: application/json" \
  -d '{"type":"inchikey","compound":"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"}'

Japanese Databases (Free Only)

DB	Contents	API	Bulk DL	License
KEGG COMPOUND	Metabolites, drugs, pathway integration	REST (free for academic)	Text format	Academic free / commercial paid
SDBS (AIST)	NMR/IR/MS spectra	None	Not available (50/day limit)	Non-commercial free
NITE-CHRIP	Japanese regulatory info (CSCL, ISHL, etc.)	None	Partial list	Free
Nikkaji RDF (NBDC)	3.6M compounds from Nikkaji	Bulk DL (SPARQL)	RDF/TTL only	CC BY 4.0

KEGG COMPOUND integrates 19,572 pathway-registered compounds and 12,826 drugs with genomic and disease data. The REST API supports compound lookup, pathway cross-referencing, and BRITE hierarchy traversal. Commercial use requires a license via Pathway Solutions.

# Fetch compound by C number
curl "https://rest.kegg.jp/get/C00031"
# Convert KEGG compound IDs to ChEBI IDs
curl "https://rest.kegg.jp/conv/chebi/compound"

SDBS (AIST) covers approximately 34,600 compounds with manually curated spectral data (FT-IR, EI-MS, ¹H NMR, ¹³C NMR). High reliability due to expert curation, but no API and a 50-spectra-per-day download limit.

NITE-CHRIP provides cross-searchable regulatory information for approximately 300,000 substances under Japanese law (CSCL, ISHL, Poisonous and Deleterious Substances Control Act, etc.). Updated every two months. GHS classification lists are partially downloadable; the Excel format is sold separately by CIRS Group.

Nikkaji RDF (NBDC) is the RDF release of the Japanese Chemical Substance Dictionary (Nikkaji). Over 3.6 million compounds, CC BY 4.0 — the only Japanese DB that allows commercial bulk download. SPARQL access required.

Regulatory Databases by Country

Chemical substance inventories and regulatory DBs from various jurisdictions.

DB	Region	Coverage	API	CSV/DL	English
ECHA CHEM	EU	C&L: 350K substances	Not available	✓ (Open Data Portal)	Yes
eChemPortal	OECD	Multi-country	Unconfirmed	—	Yes
EFSA OpenFoodTox	EU	7,880 substances	Yes	✓ (Zenodo)	Yes
NCIS	Korea	Full KECL	Unconfirmed	—	Yes
CCISS	China (3rd-party)	IECSC 47K substances	None	—	Yes
DSL	Canada	28K substances	None	✓ (CSV/XLSX)	Yes
AIIC	Australia	40K substances	None	✓ (spreadsheet)	Yes
NITE-CHRIP	Japan	300K substances	None	Partial	Yes

EU: ECHA CHEM

ECHA CHEM is run by the European Chemicals Agency and is the largest public chemical DB. It covers the C&L Inventory (4,400+ harmonized classifications, 7 million+ industry self-classifications) and REACH registration data (physicochemical, toxicological, and ecotoxicological test results). Relaunched in January 2024, with the C&L module integrated in May 2025. Bulk data is available from the ECHA Open Data Portal.

An official REST API is not yet available (announced as a future feature).

eChemPortal (OECD) provides a single search interface across regulatory DBs from ECHA REACH, the US EPA, Japan's NITE-CHRIP, and others. It covers 27,000 REACH-registered substances and over 1.3 million endpoint records.

EFSA OpenFoodTox is a toxicological evaluation dataset for 7,880 food-relevant substances (food additives, pesticides, contaminants, etc.). Accessible via the Open EFSA API and downloadable through Zenodo.

China: CCISS / IECSC

China's chemical inventory is the IECSC (现有化学物质名录), maintained by MEE (Ministry of Ecology and Environment). It covers 47,000 substances, but the official site is Chinese-only.

The practical entry point is CCISS (a free tool by CIRS Group), which provides an English-language interface for searching IECSC by CAS number or English name. It is not official, but update frequency and accuracy are reliable. For anyone who cannot read Chinese, CCISS is effectively the only access point.

Korea: NCIS

NCIS (National Chemical Information System) is the official DB run by the National Institute of Environmental Research (NIER). It covers KECL (Korea Existing Chemicals List) for K-REACH compliance, hazardous chemical lists, and GHS classification data. It is one of the few Asian government chemical DBs with an English interface.

Canada and Australia

Canada's DSL (Domestic Substances List, 28,000 substances) is available as CSV/XLSX from the Open Government Portal.

Australia's AIIC (Australian Inventory of Industrial Chemicals, 40,000 substances), maintained by AICIS, is published in spreadsheet format twice a year. Neither has an API; bulk download is the only programmatic option.

Snowflake Marketplace

There are no free chemical structure DB listings.

Major chemical structure DBs such as PubChem and ChEMBL are not available as Data Shares on the Snowflake Marketplace. Commercial listings like IQVIA (clinical and prescription data) and DrugPatentWatch (patent data) exist, but they do not contain molecular structure data.

Snowflake works as a platform rather than a data source for chemistry:

RDKit can be installed as a Snowpark Python UDF, making fingerprint calculation and similarity search available as SQL
ChEMBL and PubChem data are loaded via FTP download and Snowpark ingestion as the standard workflow
AWS terminated its S3 hosting of ChEMBL, so FTP is now the standard retrieval method

# RDKit via Snowpark Python UDF
from snowflake.snowpark.functions import udf
from snowflake.snowpark.types import StringType

@udf(return_type=StringType(), input_types=[StringType()])
def canonical_smiles(smiles: str) -> str:
    from rdkit import Chem
    mol = Chem.MolFromSmiles(smiles)
    return Chem.MolToSmiles(mol) if mol else None

As of May 2026, there is no indication that PubChem or ChEMBL are participating as Snowflake Data Shares.

CAS Number Search

CAS numbers (e.g., 50-78-2 for aspirin) are the practical cross-DB identifier. API support varies by database.

DB	CAS Search	Method
PubChem	✓	REST API (treated as compound name)
ChEMBL	✓	Python SDK (synonym filter)
KEGG	✓	REST API
UniChem	✓	POST API (cross-reference)
ECHA CHEM	✓	Web UI only (no API)
NITE-CHRIP	✓	Web UI only
NCIS (Korea)	✓	Web UI only
CCISS (China)	✓	Web UI only
DSL (Canada)	✓	Web UI + CSV download
AIIC (Australia)	✓	Web UI + spreadsheet

PubChem

CAS numbers can be passed directly as compound names.

# CAS number → CID
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/cids/JSON"

# CAS number → SMILES, molecular weight, and formula in one request
curl "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/property/MolecularFormula,MolecularWeight,IsomericSMILES/JSON"

ChEMBL

CAS numbers are stored in ChEMBL's synonyms table.

from chembl_webresource_client.new_client import new_client

molecule = new_client.molecule
results = list(molecule.filter(molecule_synonyms__synonym='50-78-2'))
if results:
    m = results[0]
    print(m['molecule_chembl_id'])
    print(m['molecule_structures']['canonical_smiles'])

KEGG

The KEGG REST API accepts CAS numbers directly in the find endpoint.

# Search KEGG compounds by CAS number
curl "https://rest.kegg.jp/find/compound/50-78-2"

# Returns TSV like: C01405\t50-78-2
# Then fetch details by C number
curl "https://rest.kegg.jp/get/C01405"

The find endpoint also accepts compound names, molecular formulas, and molecular weight ranges.

Bulk CAS → SMILES via PubChem

For pipeline use, the PUG REST POST endpoint handles batch conversion.

import requests

cas_list = ['50-78-2', '64-17-5', '7732-18-5']  # aspirin, ethanol, water

resp = requests.post(
    "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/property/IsomericSMILES/JSON",
    data={'name': '\n'.join(cas_list)}
)

for prop in resp.json().get('PropertyTable', {}).get('Properties', []):
    print(prop['CID'], prop['IsomericSMILES'])

PubChem treats CAS numbers as chemical names. One CAS number can map to multiple CIDs (salts, hydrates, stereoisomers) — if you get multiple results, take the first CID or filter by InChIKey.

Data Quality

A 2025 EPA paper flagged the propagation of incorrect CAS numbers and stereochemical information across multiple databases. PubChem aggregates from many sources, so conflicting values between them are common. For publications or regulatory submissions, check the primary source — literature or experimental data.

DEV Community