Many cheminformatics tools, including RDKit and Open Babel, are written in C or C++. When they are used from JavaScript, they are often compiled through Emscripten or exposed through generated bindings. That works, but it usually brings a larger native toolchain into the build and makes package size a constant concern.
What happens if you remove all C and C++ from that stack?
That is the idea behind chematic. I wrote a cheminformatics library from scratch in pure Rust and published it as a WASM-backed npm package. The optimized core WASM artifact is about 550 KB in my release build. The Rust side is Cargo-based; npm packaging still uses the usual WASM binding and optimization steps.
Why Pure Rust?
Cheminformatics algorithms are complicated. Ring perception in molecular graphs, stereochemistry assignment, fingerprint generation: all of these rely heavily on careful graph manipulation. Existing C++ libraries are powerful and deeply validated, but I wanted this project to keep the memory-safety boundary auditable across the whole stack.
Rust was attractive for exactly the opposite reason.
Native Fit for WASM
Rust's WASM support is integrated into the standard toolchain. The wasm32-unknown-unknown target ships with Rust, and cargo build --target wasm32-unknown-unknown is enough to produce a WASM artifact.
For the Rust crates themselves, there is no need for OS-specific native dependencies or C/C++ build scripts. The JavaScript package still has a packaging step, but the chemistry implementation does not depend on an external native library.
Memory Safety as a Constraint
chematic uses no unsafe Rust. That is not just a policy. It is a design constraint.
When working with complex data structures such as molecular graphs, I wanted the implementation to stay inside Rust's normal safety model. I relied on the ownership system and borrow checker, and implemented all operations inside safe Rust.
Reproducibility and Determinism
For fingerprint calculation, the same molecule must always produce the same bit vector. chematic uses fixed atom-environment ordering, stable serialization, fixed-width FNV-1a hashing, and avoids randomized hashers.
The goal is deterministic fingerprints across platforms and runtime environments, which makes similarity scores reproducible and easier to test.
Turning Zero FFI from a Choice into a Rule
FFI to C or C++ libraries is useful, but once it is allowed, exceptions start appearing:
- "InChI needs a C library."
- "This optimization should be written in C."
- "This one dependency is too convenient to avoid."
chematic forbids FFI itself. rdkit-sys, openbabel-sys, cc, bindgen: none of them are used. That constraint keeps the implementation coherent and makes the dependency boundary easier to reason about.
The Shape of chematic
The project is organized into 13 Rust crates:
chematic/
├── chematic-core -> basic atom, bond, and molecule types; kekulization; zero dependencies
├── chematic-smiles -> OpenSMILES parser and canonical SMILES writer
├── chematic-perception -> ring perception (SSSR) and aromaticity detection
├── chematic-mol -> MOL/SDF file format reading and writing
├── chematic-depict -> 2D SVG depiction with CPK colors and highlights
├── chematic-chem -> 40+ molecular descriptors: MW, LogP, TPSA, and more
├── chematic-fp -> 7 fingerprint types, including ECFP and MACCS
├── chematic-smarts -> SMARTS parser and substructure search using VF2
├── chematic-3d -> experimental 3D coordinate generation and force-field minimization
├── chematic-rxn -> reaction SMILES/SMIRKS parser
├── chematic-wasm -> WASM bindings for JavaScript and TypeScript
├── chematic-iupac -> IUPAC name generation in pure Rust, without network access
└── chematic -> umbrella crate integrating the full set
At the moment, all 933 tests pass. I also ran a ChEMBL 37 validation pass over 2,897,819 molecule records. In that pass, the checked SMILES inputs parsed successfully and satisfied the validation checks used by the test harness.
Implementing the Algorithms
Here are some of the core chemistry-specific algorithms and how they were implemented.
Ring Perception (SSSR)
Benzene rings, naphthalene rings, and many other molecular properties depend on ring structures. Detecting those rings from a molecular graph is the job of SSSR: the Smallest Set of Smallest Rings.
A general graph library such as petgraph can compute cycle bases, but chemistry has stricter definitions of what counts as a ring. Those definitions do not always match the generic graph-theory answer.
chematic implements an algorithm specialized for chemical requirements. SSSR is not unique for every graph, so the target is not "the one true ring set." The practical goal is stable behavior that matches the selected chemistry test cases and RDKit-style expectations used by the project.
Kekulization
When aromatic molecules such as benzene are represented in SMILES, alternating single and double bonds must be assigned. This process is called kekulization. For many common aromatic systems, the implementation can be formulated as a matching problem over the molecular graph, with explicit failure handling for cases that cannot be assigned consistently.
// Input: SMILES "c1ccccc1" (benzene, aromatic notation)
// Output: "C1=CC=CC=C1" (Kekule form with alternating single and double bonds)
Rust's type system helps prevent invalid kekulization states from leaking through the API. It does not make chemistry errors impossible, but it does make illegal intermediate states harder to represent accidentally.
CIP Stereochemistry
CIP rules, the Cahn-Ingold-Prelog rules, determine R/S configuration for chiral centers and E/Z configuration for double bonds. The priority assignment algorithm is complex: each atom receives a priority, that priority can depend on its neighbors, and the effect propagates recursively through the graph.
The current implementation covers the subset needed by the library's stereochemistry tests, including common tetrahedral and double-bond cases. More obscure CIP edge cases, such as advanced pseudoasymmetry handling, should be treated as an area for continued validation rather than a solved claim.
ECFP Fingerprints (Morgan Algorithm)
ECFP fingerprints convert the local environment around atoms into hashed identifiers, which are then used to calculate molecular similarity. The algorithm runs for multiple rounds, expanding each atom's neighborhood information and hashing it at every step.
chematic canonicalizes input order before calculation and uses a fixed-width FNV-1a hash. The important property is not the hash function alone, but the combination of stable ordering, stable serialization, and no randomized hashing. As a result, the same molecule should produce the same bit vector every time.
Scope and Packaging Tradeoffs
This is not a replacement for RDKit. RDKit has decades of chemistry validation and a much broader feature set. The tradeoff in chematic is narrower scope in exchange for a pure-Rust implementation and a smaller WASM-oriented package.
The numbers below are intended as project-level packaging context, not a benchmark. WASM and npm sizes vary depending on build options, exported APIs, JavaScript glue, compression, and optimization settings.
| Item | chematic | RDKit.js / Open Babel style stacks | OCL.js / Indigo-style stacks |
|---|---|---|---|
| Native dependency in the chemistry core | None | Usually yes | Depends on project |
| WASM/package size profile | Small optimized core artifact in my build | Often larger because mature native libraries expose broad functionality | Usually depends heavily on exported feature set |
| Build model | Cargo-based Rust crates plus WASM packaging | Native build toolchain plus generated bindings | Project-specific |
| Feature coverage | Focused subset implemented in Rust | Much broader, especially RDKit | Broader in some areas, different scope in others |
| InChI/InChIKey | No, out of scope | Often available | Often available |
The standard/reference InChI implementation is written in C. I chose not to wrap it because that would violate the no-FFI constraint. chematic documents InChI/InChIKey support as out of scope.
Instead, it covers many use cases through canonical SMILES, Murcko scaffolds, and molecular graph isomorphism.
Use Cases
This library is useful in places where a lightweight browser-side cheminformatics implementation is more important than full RDKit parity:
- Browser-side chemical database filtering: fingerprint similarity and descriptor filters that can run directly on an end user's machine.
- Prototype screening UIs: Lipinski rules, QED, SA score, and other descriptors calculated from a web interface without a backend round trip.
- Teaching and visualization workflows: molecule parsing, depiction, and experimental 3D generation for interactive demos.
- Lightweight SAR exploration: extracting simple chemical-change patterns in desktop or browser applications.
Development Progress
chematic was developed in phases, from Phase 1 for the foundation to Phase 15 for extended functionality. It is now at v0.1.32 and still actively maintained.
- Phase 1-6: core functionality: parsing, ring perception, descriptors, fingerprints, and RDKit compatibility
- Phase 7-9: extended descriptors and diversity selection: EState, VSA, SA score, MaxMin, and Butina
- Phase 10-15: mutable APIs, 2D stereochemistry, reaction formats, and IUPAC naming
At each stage, I repeated compatibility testing against ChEMBL. The ChEMBL 37 validation pass covered 2,897,819 molecule records and checked that the input structures used by the harness could be parsed and processed without failing those validation checks. This should be read as parser and pipeline validation, not as proof that every descriptor or stereochemistry result is scientifically equivalent to RDKit.
Limitations
The main limitation is scope. chematic is designed around a no-FFI pure-Rust constraint, so it intentionally does not expose everything available in mature cheminformatics systems.
- InChI and InChIKey are out of scope because wrapping the reference implementation would require FFI.
- 3D coordinate generation and force-field minimization are useful for demos and exploratory workflows, but they should not be treated as production-grade molecular modeling validation.
- CIP stereochemistry support covers the common cases tested by the project, but rare edge cases need more validation.
- RDKit compatibility is a testing target for selected behavior, not a claim of full RDKit equivalence.
Current Status and Demo
Live Demo
Descriptor calculation, fingerprint similarity, drug-likeness rules, a 3D viewer, reaction schemes, and more run in the browser through WASM:
https://kent-tokyo.github.io/chematic/
JavaScript and TypeScript Usage
npm install @kent-tokyo/chematic
import init, {
parse_smiles, get_descriptors_json, brics_fragments_json,
enumerate_stereo_isomers_json, tanimoto_ecfp4
} from '@kent-tokyo/chematic';
await init();
const mol = parse_smiles('CC(=O)Oc1ccccc1C(=O)O'); // aspirin
console.log(mol.molecular_weight()); // ~180.16
console.log(mol.qed()); // drug-likeness [0,1]
// Get multiple descriptors at once
const desc = JSON.parse(get_descriptors_json(mol));
console.log(`MW: ${desc.mw}, TPSA: ${desc.tpsa}, LogP: ${desc.logP}`);
// BRICS fragmentation for decomposing molecules
const frags = JSON.parse(brics_fragments_json('CC(=O)Oc1ccccc1C(=O)O'));
console.log(`Fragment count: ${frags.length}`);
// Stereoisomer enumeration
const isomers = JSON.parse(enumerate_stereo_isomers_json(parse_smiles('C(F)(Cl)Br')));
console.log(`Possible stereoisomers: ${isomers.length}`);
// Fingerprint similarity for screening
const caffeine = parse_smiles('Cn1cnc2c1c(=O)n(c(=O)n2C)C');
const aspirin = parse_smiles('CC(=O)Oc1ccccc1C(=O)O');
console.log(`Similarity: ${tanimoto_ecfp4(caffeine, aspirin).toFixed(3)}`);
More than 100 WASM API endpoints are exposed, and TypeScript definitions are generated automatically, so IDE completion works out of the box.
Rust Usage
[dependencies]
chematic = { version = "0.1.32", features = ["smiles", "fp", "chem"] }
use chematic::smiles::parse;
use chematic::fp::ecfp4;
let aspirin = parse("CC(=O)Oc1ccccc1C(=O)O").unwrap();
let caffeine = parse("Cn1cnc2c1c(=O)n(c(=O)n2C)C").unwrap();
let similarity = ecfp4(&aspirin).tanimoto(&ecfp4(&caffeine));
println!("Similarity: {:.3}", similarity); // ~0.4
Testing
All 933 unit tests pass. The ChEMBL 37 validation pass over 2,897,819 molecule records also completes successfully under the parser and pipeline checks described above.
That does not just mean "the code compiles." It means the implementation has been tested against a large real-world chemical database. The reported WASM binary size is measured for the optimized core artifact after wasm-opt; compressed transfer size is smaller, while npm package size depends on generated JavaScript and TypeScript files.
chematic implements a focused set of chemical information processing features, from molecular representation to similarity calculation and experimental 3D structure generation, using pure Rust and Rust's native WASM support.
It is still under active development.
Top comments (0)