DEV Community

kent-tokyo
kent-tokyo

Posted on

Building a Computer-Aided Synthesis Planning Engine in Pure Rust

I've been building renkin, a retrosynthesis engine in Pure Rust. You give it a target molecule as a SMILES string and it tries to find synthesis routes back to commercially available starting materials.

https://github.com/kent-tokyo/renkin

Background

Retrosynthesis is how organic chemists plan a synthesis: instead of asking "how do I make this?", you ask "what reaction could have produced this, and where do those precursors come from?" — working backwards from the target until you reach things you can actually buy. renkin automates that search.

SMILES is a text notation for molecular structure. Aspirin is CC(=O)Oc1ccccc1C(=O)O. That's what you pass in.

Search design

The straightforward approach — try every applicable reaction rule at every intermediate — runs into combinatorial explosion fast. A single molecule can match hundreds of rules, they apply recursively to every precursor, and the space grows exponentially. I needed something smarter.

AND-OR tree

Retrosynthesis search has a structure that doesn't fit standard graph search well. At any step, you can choose between reactions (either A or B works — an OR), but each reaction requires all its precursors simultaneously (AND). Standard graph search conflates these two, which messes up the cost accounting. renkin models the space as an AND-OR tree and searches it accordingly.

A* with SA Score

For the A* heuristic, I use the SA Score (Synthetic Accessibility Score) — a 1–10 number for how synthetically accessible a molecule is, where lower is easier. The idea is that lower SA Score intermediates are more likely to show up in building block catalogs, so steering the search in that direction tends to find better routes. It worked reasonably well in practice.

Beam search

For large molecules, even the AND-OR + A* combination can get out of hand. Beam search caps the candidates per step at N, which makes the computation predictable at the cost of some precision.

Reaction rules

Rules are written in SMARTS (a pattern language for chemical structures). The current set has 314:

  • 31: hand-crafted rules for the most common reaction types — amide bond formation, esterification, Suzuki coupling, and similar
  • 283: automatically extracted from the USPTO reaction database using rdchiral

The hand-crafted ones tend to be cleaner but don't cover much ground. The auto-extracted ones add coverage but come with noise. Template frequency weighting — giving higher priority to rules that appear more often in USPTO — turned out to be the biggest single factor in accuracy.

Benchmarks

USPTO-50k (4,907-molecule test set) is the standard evaluation for retrosynthesis tools. Here's how the numbers changed as I added each piece:

Configuration Solved Rate Rules depth beam
v0.1.0 initial (hand-crafted only) 366/4907 7.5% 31 3 50
+ auto templates (top-300) 1363/4907 27.8% 222 3 50
+ depth=5, top-500 templates 2315/4907 47.2% 314 5 50
+ beam=100 2688/4907 54.8% 314 5 100
+ template frequency weighting ~3484/4907 ~71% 314 5 100

The ~71% in the last row is confirmed on 100 molecules, not the full 4,907 — take it as a directional figure.

Comparison with other tools (same train/test split):

Tool Method USPTO-50k
renkin A* + AND-OR tree ~71% (approx.)†
GLG 58.0%
LocalRetro Neural network 53.4%
AiZynthFinder MCTS 45–53%
Retro* AND-OR tree search 44.3%
ASKCOS MCTS 41%

† renkin's figure is from a 100-molecule sample; other tools used the full 4,907. This comparison still needs more work — I haven't verified whether the number holds at full scale.

The jump from template frequency weighting alone was larger than I expected. It's the thing I'd add first if starting over.

Why Pure Rust

renkin is built on chematic, a Pure Rust cheminformatics library I wrote earlier.

https://github.com/kent-tokyo/chematic

That means SMARTS matching, molecular graph operations, and SA Score calculation are all in safe Rust, no FFI. cargo build is enough, and it compiles to WebAssembly (~500 KB). For parallel rule application, renkin uses rayon — including a WASM-compatible build that runs through Web Workers, though that path hasn't had as much testing yet.

Usage

Python

pip install renkin
Enter fullscreen mode Exit fullscreen mode
import renkin

result = renkin.find_routes(
    "CC(=O)Oc1ccccc1C(=O)O",  # aspirin
    depth=5,
    max_routes=3,
)

for route in result["routes"]:
    for step in route["steps"]:
        print(f"  {step['target']} -> {' + '.join(step['precursors'])}  [{step['rule']}]")
Enter fullscreen mode Exit fullscreen mode

CLI (Rust)

cargo install renkin
Enter fullscreen mode Exit fullscreen mode
renkin --target "CC(=O)Oc1ccccc1C(=O)O" --depth 5 \
    --templates data/templates_extracted.smi
Enter fullscreen mode Exit fullscreen mode

JavaScript / Node.js

npm install renkin
Enter fullscreen mode Exit fullscreen mode
import init, { find_routes } from 'renkin';

await init();
const result = JSON.parse(find_routes("CC(=O)Oc1ccccc1C(=O)O", 5, 3, 0));
Enter fullscreen mode Exit fullscreen mode

What's left

314 rules aren't enough for complex molecules like natural products — success rates drop there. I want to try pulling more templates from sources beyond USPTO.

Scoring routes by step count, yield, and cost (rather than just solved/not-solved) is also on the list. And a browser UI for stepping through the AND-OR tree is in progress.


Retrosynthesis engine "renkin":
https://github.com/kent-tokyo/renkin

The cheminformatics library underneath, "chematic":
https://github.com/kent-tokyo/chematic

Top comments (0)