Idea
While reproducing top solutions of a chemistry data competition, I started building a DuckDB community extension for handling chemistry data directly in SQL.
What it can do
- Parse SMILES, InChI, PDB and other chemistry formats directly — no pandas, no RDKit on the side
- Plug into DuckDB's native CSV/Parquet/Iceberg/S3/HTTP readers, so ingestion + light preprocessing happens in one query
Background
What is chemistry data, anyway? One of the canonical forms is SMILES, a notation that encodes a molecular structure as a plain string.
The standard library for reading and processing SMILES is RDKit. For example, ethanol is CCO in SMILES, and RDKit will give you the molecular formula C2H6O from it.
RDKit doesn't stop there — molecular weight, fingerprints, descriptors, substructure search, and so on. It's basically the must-have library for chemistry data work. (Internally, the algorithms come from a bunch of different papers...)
So you might say: do we even need a new extension? In practice though, I kept hitting these:
- Doesn't always work cleanly on Google Colab (often needs a runtime restart)
- pandas version conflicts
- To export results, you usually have to bounce through pandas anyway, or pull in yet another tool depending on the feature you want
What I built
So, here's a DuckDB extension that addresses these.
ducksmiles
https://duckdb.org/community_extensions/extensions/ducksmiles
I implemented a mix of features RDKit covers and a few that are awkward with RDKit alone.
- Compute molecular formula / weight from a SMILES string and write back to CSV → The most basic chemistry preprocessing flow, in a single SQL line
- Decompose an InChI string into its formula / connection / hydrogen / charge / stereo layers, and compare two InChIs at the skeleton level → With RDKit you have to round-trip through a Mol object; ducksmiles pulls the layers straight from the string
- Convert SMILES ⇄ SELFIES (a notation that's easier to use for ML models)
→ Normally you'd need a separate
selfiespackage; this is built in - Aggregate atom / chain / residue counts from PDB / CIF / XYZ (3D structure files) with SQL → Get metadata from heavy structure files without ever opening them in code
- Return
NULLinstead of throwing on invalid SMILES or structures → A million-row batch won't blow up because of one malformed row
Try it out
I'll use two public datasets — DeepChem ESOL CSV and RCSB PDB — and paste the actual CLI output for each of the five capabilities above.
duckdb
INSTALL ducksmiles FROM community;
LOAD ducksmiles;
1. SMILES → formula / weight
SELECT smiles, mol_formula(smiles) AS formula, mol_weight(smiles) AS weight
FROM read_csv_auto('https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv')
LIMIT 5;
┌────────────────────────────────────────────────────────┬───────────┬─────────┐
│ smiles │ formula │ weight │
├────────────────────────────────────────────────────────┼───────────┼─────────┤
│ OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O │ NULL │ NULL │
│ Cc1occc1C(=O)Nc2ccccc2 │ C12H11NO2 │ 201.225 │
│ CC(C)=CCCC(C)=CC(=O) │ C10H16O │ 152.237 │
│ c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43 │ C22H14 │ 278.354 │
│ c1ccsc1 │ C4H4S │ 84.136 │
└────────────────────────────────────────────────────────┴───────────┴─────────┘
The public CSV URL goes straight into read_csv_auto, and you get formulas and weights right there. The first row's complex SMILES returns NULL rather than throwing.
2. Decompose an InChI into its layers
SELECT
name,
inchi_formula(inchi) AS formula,
inchi_connections(inchi) AS connections,
inchi_hydrogens(inchi) AS hydrogens
FROM (VALUES
('Ethanol', 'InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3'),
('Caffeine', 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3')
) AS t(name, inchi);
┌──────────┬───────────┬───────────────────────────────────────┬────────────┐
│ name │ formula │ connections │ hydrogens │
├──────────┼───────────┼───────────────────────────────────────┼────────────┤
│ Ethanol │ C2H6O │ 1-2-3 │ 3H,2H2,1H3 │
│ Caffeine │ C8H10N4O2 │ 1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2 │ 4H,1-3H3 │
└──────────┴───────────┴───────────────────────────────────────┴────────────┘
InChI is a notation that bundles atom connectivity, hydrogen positions, and other facets into a single string, split across labeled sections. For example, ethanol looks like this:
InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3
└─────┘ └────┘ └────────┘
formula conn. hydrogens
inchi_formula inchi_connections inchi_hydrogens
Normally you'd round-trip through RDKit's Mol object to extract pieces. With ducksmiles, you pull the section you want straight from the string. The query result above (C2H6O / 1-2-3 / 3H,2H2,1H3) is exactly the substring of those layers in the original InChI.
3. SMILES ⇄ SELFIES roundtrip
Pulling some short SMILES out of ESOL and running SMILES → SELFIES → SMILES:
SELECT
smiles,
smiles_to_selfies(smiles) AS selfies,
selfies_to_smiles(smiles_to_selfies(smiles)) AS roundtrip
FROM read_csv_auto('https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv')
WHERE LENGTH(smiles) < 12
LIMIT 5;
┌───────────┬─────────────────────────────────────────┬───────────┐
│ smiles │ selfies │ roundtrip │
├───────────┼─────────────────────────────────────────┼───────────┤
│ c1ccsc1 │ [c][Ring1][C][c][c][s][c][Ring1][C] │ cccsc │
│ O=C1CCCN1 │ [O][=C][Ring1][C][C][C][C][N][Ring1][C] │ O=CCCCN │
│ CCCC=C │ [C][C][C][C][=C] │ CCCC=C │
│ CC(C)Cl │ [C][C][Branch1][C][C][Cl] │ CC(C)Cl │
│ CCC(C)CO │ [C][C][C][Branch1][C][C][C][O] │ CCC(C)CO │
└───────────┴─────────────────────────────────────────┴───────────┘
Both directions of the conversion run inside DuckDB. Normally you'd need a separate Python selfies package on the side.
4. Metadata from PDB structure files
read_text accepts HTTPS URLs and arrays directly, so you can pull three PDB files straight from RCSB and aggregate them in one go:
INSTALL httpfs; LOAD httpfs;
SELECT
filename,
structure_atom_count(content) AS atoms,
structure_chain_count(content) AS chains,
structure_residue_count(content) AS residues,
structure_model_count(content) AS models
FROM read_text([
'https://files.rcsb.org/download/1AKE.pdb',
'https://files.rcsb.org/download/1CRN.pdb',
'https://files.rcsb.org/download/4HHB.pdb'
]);
┌──────────────────────────────────────────┬───────┬────────┬──────────┬────────┐
│ filename │ atoms │ chains │ residues │ models │
├──────────────────────────────────────────┼───────┼────────┼──────────┼────────┤
│ https://files.rcsb.org/download/1AKE.pdb │ 3816 │ 2 │ 428 │ 1 │
│ https://files.rcsb.org/download/1CRN.pdb │ 327 │ 1 │ 46 │ 1 │
│ https://files.rcsb.org/download/4HHB.pdb │ 4779 │ 4 │ 574 │ 1 │
└──────────────────────────────────────────┴───────┴────────┴──────────┴────────┘
read_text returns the file content as a content column, so you just feed it to the structure_* functions. You also get filename for free, which makes it easy to keep track of the original file when you scale this to thousands.
5. Bad input returns NULL instead of crashing
SELECT
smiles,
mol_is_valid(smiles) AS valid,
mol_formula(smiles) AS formula,
mol_weight(smiles) AS weight
FROM (VALUES
('CCO'),
('not_a_smiles'),
('c1ccccc1'),
(''),
('CC(=O)O')
) AS t(smiles);
┌──────────────┬───────┬─────────┬────────┐
│ smiles │ valid │ formula │ weight │
├──────────────┼───────┼─────────┼────────┤
│ CCO │ true │ C2H6O │ 46.069 │
│ not_a_smiles │ false │ NULL │ NULL │
│ c1ccccc1 │ true │ C6H6 │ 78.114 │
│ │ false │ NULL │ NULL │
│ CC(=O)O │ true │ C2H4O2 │ 60.052 │
└──────────────┴───────┴─────────┴────────┘
Even with malformed SMILES or empty strings mixed in, those rows just become NULL — the query keeps going. It's built for batch.
Closing thoughts
One thing that struck me while working with chemistry data is just how many formats there are.
SMILES, InChI, SDF... each has its own format, and each tends to want its own dedicated tool.
On top of that, RDKit itself has a huge surface — its "descriptors" alone come in around 200 varieties, each backed by different papers.
What I want ducksmiles to grow into is a single tool that absorbs those plumbing and preprocessing concerns, so you can move faster without juggling toolchains. There's a lot to learn from the prior art, and I plan to chip away at it over time.
Also — DuckDB itself is a general-purpose engine, so it can't cover every domain's specific needs. That's exactly what community extensions are for, and the bar to author one is surprisingly low. If you have a domain pain point that DuckDB doesn't quite cover, please give it a try.
Top comments (0)