kent-tokyo

Posted on Jun 7

Building chematic: A Pure-Rust Cheminformatics Library for WebAssembly

#rust #webassembly #opensource #chemistry

Many cheminformatics tools, including RDKit and Open Babel, are written in C or C++. When they are used from JavaScript, they are often compiled through Emscripten or exposed through generated bindings. That works, but it usually brings a larger native toolchain into the build and makes package size a constant concern.

What happens if you remove all C and C++ from that stack?

That is the idea behind chematic. I wrote a cheminformatics library from scratch in pure Rust and published it as a WASM-backed npm package. The optimized core WASM artifact is about 550 KB in my release build. The Rust side is Cargo-based; npm packaging still uses the usual WASM binding and optimization steps.

Why Pure Rust?

Cheminformatics algorithms are complicated. Ring perception in molecular graphs, stereochemistry assignment, fingerprint generation: all of these rely heavily on careful graph manipulation. Existing C++ libraries are powerful and deeply validated, but I wanted this project to keep the memory-safety boundary auditable across the whole stack.

Rust was attractive for exactly the opposite reason.

Native Fit for WASM

Rust's WASM support is integrated into the standard toolchain. The wasm32-unknown-unknown target ships with Rust, and cargo build --target wasm32-unknown-unknown is enough to produce a WASM artifact.

For the Rust crates themselves, there is no need for OS-specific native dependencies or C/C++ build scripts. The JavaScript package still has a packaging step, but the chemistry implementation does not depend on an external native library.

Memory Safety as a Constraint

chematic uses no unsafe Rust. That is not just a policy. It is a design constraint.

When working with complex data structures such as molecular graphs, I wanted the implementation to stay inside Rust's normal safety model. I relied on the ownership system and borrow checker, and implemented all operations inside safe Rust.

Reproducibility and Determinism

For fingerprint calculation, the same molecule must always produce the same bit vector. chematic uses fixed atom-environment ordering, stable serialization, fixed-width FNV-1a hashing, and avoids randomized hashers.

The goal is deterministic fingerprints across platforms and runtime environments, which makes similarity scores reproducible and easier to test.

Turning Zero FFI from a Choice into a Rule

FFI to C or C++ libraries is useful, but once it is allowed, exceptions start appearing:

"InChI needs a C library."
"This optimization should be written in C."
"This one dependency is too convenient to avoid."

chematic forbids FFI itself. rdkit-sys, openbabel-sys, cc, bindgen: none of them are used. That constraint keeps the implementation coherent and makes the dependency boundary easier to reason about.

The Shape of `chematic`

The project is organized into 13 Rust crates:

chematic/
├── chematic-core       -> basic atom, bond, and molecule types; kekulization; zero dependencies
├── chematic-smiles     -> OpenSMILES parser and canonical SMILES writer
├── chematic-perception -> ring perception (SSSR) and aromaticity detection
├── chematic-mol        -> MOL/SDF file format reading and writing
├── chematic-depict     -> 2D SVG depiction with CPK colors and highlights
├── chematic-chem       -> 40+ molecular descriptors: MW, LogP, TPSA, and more
├── chematic-fp         -> 7 fingerprint types, including ECFP and MACCS
├── chematic-smarts     -> SMARTS parser and substructure search using VF2
├── chematic-3d         -> experimental 3D coordinate generation and force-field minimization
├── chematic-rxn        -> reaction SMILES/SMIRKS parser
├── chematic-wasm       -> WASM bindings for JavaScript and TypeScript
├── chematic-iupac      -> IUPAC name generation in pure Rust, without network access
└── chematic            -> umbrella crate integrating the full set

At the moment, all 933 tests pass. I also ran a ChEMBL 37 validation pass over 2,897,819 molecule records. In that pass, the checked SMILES inputs parsed successfully and satisfied the validation checks used by the test harness.

Implementing the Algorithms

Here are some of the core chemistry-specific algorithms and how they were implemented.

Ring Perception (SSSR)

Benzene rings, naphthalene rings, and many other molecular properties depend on ring structures. Detecting those rings from a molecular graph is the job of SSSR: the Smallest Set of Smallest Rings.

A general graph library such as petgraph can compute cycle bases, but chemistry has stricter definitions of what counts as a ring. Those definitions do not always match the generic graph-theory answer.

chematic implements an algorithm specialized for chemical requirements. SSSR is not unique for every graph, so the target is not "the one true ring set." The practical goal is stable behavior that matches the selected chemistry test cases and RDKit-style expectations used by the project.

Kekulization

When aromatic molecules such as benzene are represented in SMILES, alternating single and double bonds must be assigned. This process is called kekulization. For many common aromatic systems, the implementation can be formulated as a matching problem over the molecular graph, with explicit failure handling for cases that cannot be assigned consistently.

// Input:  SMILES "c1ccccc1" (benzene, aromatic notation)
// Output: "C1=CC=CC=C1"   (Kekule form with alternating single and double bonds)

Rust's type system helps prevent invalid kekulization states from leaking through the API. It does not make chemistry errors impossible, but it does make illegal intermediate states harder to represent accidentally.

CIP Stereochemistry

CIP rules, the Cahn-Ingold-Prelog rules, determine R/S configuration for chiral centers and E/Z configuration for double bonds. The priority assignment algorithm is complex: each atom receives a priority, that priority can depend on its neighbors, and the effect propagates recursively through the graph.

The current implementation covers the subset needed by the library's stereochemistry tests, including common tetrahedral and double-bond cases. More obscure CIP edge cases, such as advanced pseudoasymmetry handling, should be treated as an area for continued validation rather than a solved claim.

ECFP Fingerprints (Morgan Algorithm)

ECFP fingerprints convert the local environment around atoms into hashed identifiers, which are then used to calculate molecular similarity. The algorithm runs for multiple rounds, expanding each atom's neighborhood information and hashing it at every step.

chematic canonicalizes input order before calculation and uses a fixed-width FNV-1a hash. The important property is not the hash function alone, but the combination of stable ordering, stable serialization, and no randomized hashing. As a result, the same molecule should produce the same bit vector every time.

Scope and Packaging Tradeoffs

This is not a replacement for RDKit. RDKit has decades of chemistry validation and a much broader feature set. The tradeoff in chematic is narrower scope in exchange for a pure-Rust implementation and a smaller WASM-oriented package.

The numbers below are intended as project-level packaging context, not a benchmark. WASM and npm sizes vary depending on build options, exported APIs, JavaScript glue, compression, and optimization settings.

Item	chematic	RDKit.js / Open Babel style stacks	OCL.js / Indigo-style stacks
Native dependency in the chemistry core	None	Usually yes	Depends on project
WASM/package size profile	Small optimized core artifact in my build	Often larger because mature native libraries expose broad functionality	Usually depends heavily on exported feature set
Build model	Cargo-based Rust crates plus WASM packaging	Native build toolchain plus generated bindings	Project-specific
Feature coverage	Focused subset implemented in Rust	Much broader, especially RDKit	Broader in some areas, different scope in others
InChI/InChIKey	No, out of scope	Often available	Often available

The standard/reference InChI implementation is written in C. I chose not to wrap it because that would violate the no-FFI constraint. chematic documents InChI/InChIKey support as out of scope.

Instead, it covers many use cases through canonical SMILES, Murcko scaffolds, and molecular graph isomorphism.

Use Cases

This library is useful in places where a lightweight browser-side cheminformatics implementation is more important than full RDKit parity:

Browser-side chemical database filtering: fingerprint similarity and descriptor filters that can run directly on an end user's machine.
Prototype screening UIs: Lipinski rules, QED, SA score, and other descriptors calculated from a web interface without a backend round trip.
Teaching and visualization workflows: molecule parsing, depiction, and experimental 3D generation for interactive demos.
Lightweight SAR exploration: extracting simple chemical-change patterns in desktop or browser applications.

Development Progress

chematic was developed in phases, from Phase 1 for the foundation to Phase 15 for extended functionality. It is now at v0.1.32 and still actively maintained.

Phase 1-6: core functionality: parsing, ring perception, descriptors, fingerprints, and RDKit compatibility
Phase 7-9: extended descriptors and diversity selection: EState, VSA, SA score, MaxMin, and Butina
Phase 10-15: mutable APIs, 2D stereochemistry, reaction formats, and IUPAC naming

At each stage, I repeated compatibility testing against ChEMBL. The ChEMBL 37 validation pass covered 2,897,819 molecule records and checked that the input structures used by the harness could be parsed and processed without failing those validation checks. This should be read as parser and pipeline validation, not as proof that every descriptor or stereochemistry result is scientifically equivalent to RDKit.

Limitations

The main limitation is scope. chematic is designed around a no-FFI pure-Rust constraint, so it intentionally does not expose everything available in mature cheminformatics systems.

InChI and InChIKey are out of scope because wrapping the reference implementation would require FFI.
3D coordinate generation and force-field minimization are useful for demos and exploratory workflows, but they should not be treated as production-grade molecular modeling validation.
CIP stereochemistry support covers the common cases tested by the project, but rare edge cases need more validation.
RDKit compatibility is a testing target for selected behavior, not a claim of full RDKit equivalence.

Current Status and Demo

Live Demo

Descriptor calculation, fingerprint similarity, drug-likeness rules, a 3D viewer, reaction schemes, and more run in the browser through WASM:

https://kent-tokyo.github.io/chematic/

JavaScript and TypeScript Usage

npm install @kent-tokyo/chematic

import init, {
  parse_smiles, get_descriptors_json, brics_fragments_json,
  enumerate_stereo_isomers_json, tanimoto_ecfp4
} from '@kent-tokyo/chematic';

await init();

const mol = parse_smiles('CC(=O)Oc1ccccc1C(=O)O'); // aspirin
console.log(mol.molecular_weight()); // ~180.16
console.log(mol.qed());              // drug-likeness [0,1]

// Get multiple descriptors at once
const desc = JSON.parse(get_descriptors_json(mol));
console.log(`MW: ${desc.mw}, TPSA: ${desc.tpsa}, LogP: ${desc.logP}`);

// BRICS fragmentation for decomposing molecules
const frags = JSON.parse(brics_fragments_json('CC(=O)Oc1ccccc1C(=O)O'));
console.log(`Fragment count: ${frags.length}`);

// Stereoisomer enumeration
const isomers = JSON.parse(enumerate_stereo_isomers_json(parse_smiles('C(F)(Cl)Br')));
console.log(`Possible stereoisomers: ${isomers.length}`);

// Fingerprint similarity for screening
const caffeine = parse_smiles('Cn1cnc2c1c(=O)n(c(=O)n2C)C');
const aspirin = parse_smiles('CC(=O)Oc1ccccc1C(=O)O');
console.log(`Similarity: ${tanimoto_ecfp4(caffeine, aspirin).toFixed(3)}`);

More than 100 WASM API endpoints are exposed, and TypeScript definitions are generated automatically, so IDE completion works out of the box.

Rust Usage

[dependencies]
chematic = { version = "0.1.32", features = ["smiles", "fp", "chem"] }

use chematic::smiles::parse;
use chematic::fp::ecfp4;

let aspirin = parse("CC(=O)Oc1ccccc1C(=O)O").unwrap();
let caffeine = parse("Cn1cnc2c1c(=O)n(c(=O)n2C)C").unwrap();

let similarity = ecfp4(&aspirin).tanimoto(&ecfp4(&caffeine));
println!("Similarity: {:.3}", similarity); // ~0.4

Testing

All 933 unit tests pass. The ChEMBL 37 validation pass over 2,897,819 molecule records also completes successfully under the parser and pipeline checks described above.

That does not just mean "the code compiles." It means the implementation has been tested against a large real-world chemical database. The reported WASM binary size is measured for the optimized core artifact after wasm-opt; compressed transfer size is smaller, while npm package size depends on generated JavaScript and TypeScript files.

chematic implements a focused set of chemical information processing features, from molecular representation to similarity calculation and experimental 3D structure generation, using pure Rust and Rust's native WASM support.

It is still under active development.

https://github.com/kent-tokyo/chematic

DEV Community

Building chematic: A Pure-Rust Cheminformatics Library for WebAssembly

Why Pure Rust?

Native Fit for WASM

Memory Safety as a Constraint

Reproducibility and Determinism

Turning Zero FFI from a Choice into a Rule

The Shape of `chematic`

Implementing the Algorithms

Ring Perception (SSSR)

Kekulization

CIP Stereochemistry

ECFP Fingerprints (Morgan Algorithm)

Scope and Packaging Tradeoffs

Use Cases

Development Progress

Limitations

Current Status and Demo

Live Demo

JavaScript and TypeScript Usage

Rust Usage

Testing

Top comments (0)

Why Pure Rust?

Native Fit for WASM

Memory Safety as a Constraint

Reproducibility and Determinism

Turning Zero FFI from a Choice into a Rule

The Shape of chematic

Implementing the Algorithms

Ring Perception (SSSR)

Kekulization

CIP Stereochemistry

ECFP Fingerprints (Morgan Algorithm)

Scope and Packaging Tradeoffs

Use Cases

Development Progress

Limitations

Current Status and Demo

Live Demo

JavaScript and TypeScript Usage

Rust Usage

Testing

The Shape of `chematic`