nk_Enuke

Posted on May 2

I built a DuckDB extension to handle chemistry data without pandas or RDKit

#rust #duckdb #chemistry #dataengineering

Idea

While reproducing top solutions of a chemistry data competition, I started building a DuckDB community extension for handling chemistry data directly in SQL.

What it can do

Parse SMILES, InChI, PDB and other chemistry formats directly — no pandas, no RDKit on the side
Plug into DuckDB's native CSV/Parquet/Iceberg/S3/HTTP readers, so ingestion + light preprocessing happens in one query

Background

What is chemistry data, anyway? One of the canonical forms is SMILES, a notation that encodes a molecular structure as a plain string.

The standard library for reading and processing SMILES is RDKit. For example, ethanol is CCO in SMILES, and RDKit will give you the molecular formula C2H6O from it.

RDKit doesn't stop there — molecular weight, fingerprints, descriptors, substructure search, and so on. It's basically the must-have library for chemistry data work. (Internally, the algorithms come from a bunch of different papers...)

So you might say: do we even need a new extension? In practice though, I kept hitting these:

Doesn't always work cleanly on Google Colab (often needs a runtime restart)
pandas version conflicts
To export results, you usually have to bounce through pandas anyway, or pull in yet another tool depending on the feature you want

What I built

So, here's a DuckDB extension that addresses these.

ducksmiles

https://duckdb.org/community_extensions/extensions/ducksmiles

I implemented a mix of features RDKit covers and a few that are awkward with RDKit alone.

Compute molecular formula / weight from a SMILES string and write back to CSV → The most basic chemistry preprocessing flow, in a single SQL line
Decompose an InChI string into its formula / connection / hydrogen / charge / stereo layers, and compare two InChIs at the skeleton level → With RDKit you have to round-trip through a Mol object; ducksmiles pulls the layers straight from the string
Convert SMILES ⇄ SELFIES (a notation that's easier to use for ML models) → Normally you'd need a separate selfies package; this is built in
Aggregate atom / chain / residue counts from PDB / CIF / XYZ (3D structure files) with SQL → Get metadata from heavy structure files without ever opening them in code
Return NULL instead of throwing on invalid SMILES or structures → A million-row batch won't blow up because of one malformed row

Try it out

I'll use two public datasets — DeepChem ESOL CSV and RCSB PDB — and paste the actual CLI output for each of the five capabilities above.

duckdb

INSTALL ducksmiles FROM community;
LOAD ducksmiles;

1. SMILES → formula / weight

SELECT smiles, mol_formula(smiles) AS formula, mol_weight(smiles) AS weight
FROM read_csv_auto('https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv')
LIMIT 5;

┌────────────────────────────────────────────────────────┬───────────┬─────────┐
│                         smiles                         │  formula  │ weight  │
├────────────────────────────────────────────────────────┼───────────┼─────────┤
│ OCC3OC(OCC2OC(OC(C#N)c1ccccc1)C(O)C(O)C2O)C(O)C(O)C3O  │ NULL      │ NULL    │
│ Cc1occc1C(=O)Nc2ccccc2                                 │ C12H11NO2 │ 201.225 │
│ CC(C)=CCCC(C)=CC(=O)                                   │ C10H16O   │ 152.237 │
│ c1ccc2c(c1)ccc3c2ccc4c5ccccc5ccc43                     │ C22H14    │ 278.354 │
│ c1ccsc1                                                │ C4H4S     │ 84.136  │
└────────────────────────────────────────────────────────┴───────────┴─────────┘

The public CSV URL goes straight into read_csv_auto, and you get formulas and weights right there. The first row's complex SMILES returns NULL rather than throwing.

2. Decompose an InChI into its layers

SELECT
  name,
  inchi_formula(inchi) AS formula,
  inchi_connections(inchi) AS connections,
  inchi_hydrogens(inchi) AS hydrogens
FROM (VALUES
  ('Ethanol',  'InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3'),
  ('Caffeine', 'InChI=1S/C8H10N4O2/c1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2/h4H,1-3H3')
) AS t(name, inchi);

┌──────────┬───────────┬───────────────────────────────────────┬────────────┐
│   name   │  formula  │              connections              │ hydrogens  │
├──────────┼───────────┼───────────────────────────────────────┼────────────┤
│ Ethanol  │ C2H6O     │ 1-2-3                                 │ 3H,2H2,1H3 │
│ Caffeine │ C8H10N4O2 │ 1-10-4-9-6-5(10)7(13)12(3)8(14)11(6)2 │ 4H,1-3H3   │
└──────────┴───────────┴───────────────────────────────────────┴────────────┘

InChI is a notation that bundles atom connectivity, hydrogen positions, and other facets into a single string, split across labeled sections. For example, ethanol looks like this:

InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3
        └─────┘ └────┘ └────────┘
        formula  conn.    hydrogens
       inchi_formula  inchi_connections  inchi_hydrogens

Normally you'd round-trip through RDKit's Mol object to extract pieces. With ducksmiles, you pull the section you want straight from the string. The query result above (C2H6O / 1-2-3 / 3H,2H2,1H3) is exactly the substring of those layers in the original InChI.

3. SMILES ⇄ SELFIES roundtrip

Pulling some short SMILES out of ESOL and running SMILES → SELFIES → SMILES:

SELECT
  smiles,
  smiles_to_selfies(smiles) AS selfies,
  selfies_to_smiles(smiles_to_selfies(smiles)) AS roundtrip
FROM read_csv_auto('https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/delaney-processed.csv')
WHERE LENGTH(smiles) < 12
LIMIT 5;

┌───────────┬─────────────────────────────────────────┬───────────┐
│  smiles   │                 selfies                 │ roundtrip │
├───────────┼─────────────────────────────────────────┼───────────┤
│ c1ccsc1   │ [c][Ring1][C][c][c][s][c][Ring1][C]     │ cccsc     │
│ O=C1CCCN1 │ [O][=C][Ring1][C][C][C][C][N][Ring1][C] │ O=CCCCN   │
│ CCCC=C    │ [C][C][C][C][=C]                        │ CCCC=C    │
│ CC(C)Cl   │ [C][C][Branch1][C][C][Cl]               │ CC(C)Cl   │
│ CCC(C)CO  │ [C][C][C][Branch1][C][C][C][O]          │ CCC(C)CO  │
└───────────┴─────────────────────────────────────────┴───────────┘

Both directions of the conversion run inside DuckDB. Normally you'd need a separate Python selfies package on the side.

4. Metadata from PDB structure files

read_text accepts HTTPS URLs and arrays directly, so you can pull three PDB files straight from RCSB and aggregate them in one go:

INSTALL httpfs; LOAD httpfs;

SELECT
  filename,
  structure_atom_count(content)    AS atoms,
  structure_chain_count(content)   AS chains,
  structure_residue_count(content) AS residues,
  structure_model_count(content)   AS models
FROM read_text([
  'https://files.rcsb.org/download/1AKE.pdb',
  'https://files.rcsb.org/download/1CRN.pdb',
  'https://files.rcsb.org/download/4HHB.pdb'
]);

┌──────────────────────────────────────────┬───────┬────────┬──────────┬────────┐
│                 filename                 │ atoms │ chains │ residues │ models │
├──────────────────────────────────────────┼───────┼────────┼──────────┼────────┤
│ https://files.rcsb.org/download/1AKE.pdb │ 3816  │ 2      │ 428      │ 1      │
│ https://files.rcsb.org/download/1CRN.pdb │ 327   │ 1      │ 46       │ 1      │
│ https://files.rcsb.org/download/4HHB.pdb │ 4779  │ 4      │ 574      │ 1      │
└──────────────────────────────────────────┴───────┴────────┴──────────┴────────┘

read_text returns the file content as a content column, so you just feed it to the structure_* functions. You also get filename for free, which makes it easy to keep track of the original file when you scale this to thousands.

5. Bad input returns NULL instead of crashing

SELECT
  smiles,
  mol_is_valid(smiles) AS valid,
  mol_formula(smiles) AS formula,
  mol_weight(smiles) AS weight
FROM (VALUES
  ('CCO'),
  ('not_a_smiles'),
  ('c1ccccc1'),
  (''),
  ('CC(=O)O')
) AS t(smiles);

┌──────────────┬───────┬─────────┬────────┐
│    smiles    │ valid │ formula │ weight │
├──────────────┼───────┼─────────┼────────┤
│ CCO          │ true  │ C2H6O   │ 46.069 │
│ not_a_smiles │ false │ NULL    │ NULL   │
│ c1ccccc1     │ true  │ C6H6    │ 78.114 │
│              │ false │ NULL    │ NULL   │
│ CC(=O)O      │ true  │ C2H4O2  │ 60.052 │
└──────────────┴───────┴─────────┴────────┘

Even with malformed SMILES or empty strings mixed in, those rows just become NULL — the query keeps going. It's built for batch.

Closing thoughts

One thing that struck me while working with chemistry data is just how many formats there are.

SMILES, InChI, SDF... each has its own format, and each tends to want its own dedicated tool.

On top of that, RDKit itself has a huge surface — its "descriptors" alone come in around 200 varieties, each backed by different papers.

What I want ducksmiles to grow into is a single tool that absorbs those plumbing and preprocessing concerns, so you can move faster without juggling toolchains. There's a lot to learn from the prior art, and I plan to chip away at it over time.

Also — DuckDB itself is a general-purpose engine, so it can't cover every domain's specific needs. That's exactly what community extensions are for, and the bar to author one is surprisingly low. If you have a domain pain point that DuckDB doesn't quite cover, please give it a try.

DEV Community