kent-tokyo

Posted on Jun 27

How you pass molecules to an LLM matters: features I built into my Rust cheminformatics library after reading recent arXiv papers

#rust #chemistry #cheminformatics #llm

Introduction

Simply passing a SMILES string into an LLM prompt is not enough to make the model reason correctly about molecular structure.

Take aspirin: its SMILES is CC(=O)Oc1ccccc1C(=O)O. A chemist can read it, but an LLM has to parse which atom is bonded to which from a flat string — putting it at a disadvantage on structure-understanding tasks.

Reading recent arXiv papers, three directions emerge for addressing this:

Better molecular representations: use a format more explicit than SMILES
Richer context: pass property data and similar molecules alongside the structure
Role separation: let LLMs judge and explain; hand calculations to deterministic tools

I've been building chematic, a cheminformatics library written in pure Rust with Python, WASM, and MCP support. It handles molecule parsing, property calculation, similarity search, 3D generation, and more. This article covers the features I added with these three directions in mind.

1. ChemicalJSON: explicit graph representation

Motivation

The arXiv paper "Molecular Representations for Large Language Models" (May 2026) asks what a "good" molecular representation looks like for an LLM.

It introduces MolJSON and benchmarks it against five common formats — SMILES, IUPAC name, InChI, and others. The results show that explicit graph representations outperform compressed string formats on structure-understanding tasks. On a shortest-path reasoning benchmark with GPT-5, MolJSON reached 98.5% accuracy vs. 92.2% for SMILES and 82.7% for IUPAC names.

SMILES compresses molecular structure into a string, so an LLM must implicitly parse "which atom bonds to which." An explicit JSON with an atom list and bond list makes that structure directly readable.

Implementation

mol = chematic.from_smiles("c1ccccc1")  # benzene

# convert to explicit graph JSON
cjson = mol.to_cjson()

# reconstruct from JSON
mol2 = chematic.from_cjson(cjson)

The output looks like this:

{
  "atoms": [
    {"index": 0, "element": "C", "aromatic": true},
    {"index": 1, "element": "C", "aromatic": true},
    {"index": 2, "element": "C", "aromatic": true},
    {"index": 3, "element": "C", "aromatic": true},
    {"index": 4, "element": "C", "aromatic": true},
    {"index": 5, "element": "C", "aromatic": true}
  ],
  "bonds": [
    {"begin": 0, "end": 1, "order": "aromatic"},
    {"begin": 1, "end": 2, "order": "aromatic"},
    {"begin": 2, "end": 3, "order": "aromatic"},
    {"begin": 3, "end": 4, "order": "aromatic"},
    {"begin": 4, "end": 5, "order": "aromatic"},
    {"begin": 5, "end": 0, "order": "aromatic"}
  ]
}

The fact that c1ccccc1 represents six carbons in a ring is implicit in the string but explicit in the JSON. LLMs and AI agents can reference the structure directly in this form.

chematic implements ChemicalJSON rather than MolJSON itself, but the intent is the same.

2. Molecule context: describe / review / report

Motivation

The arXiv paper "MolE-RAG" proposes a way to improve LLM-based molecular property prediction. Instead of passing only a SMILES string, it passes property descriptors, structural alerts, and structurally similar known compounds as inference-time context — applying the RAG (Retrieval-Augmented Generation) idea to molecular reasoning. Results show up to 28-point ROC-AUC improvement over a SMILES-only baseline.

Implementation

mol.review() returns a text summary of a molecule:

mol = chematic.from_smiles("CC(=O)Oc1ccccc1C(=O)O")  # aspirin
print(mol.review())
# MW: 180.2, LogP: 1.31, TPSA: 63.6, HBD: 1, HBA: 3
# Rotatable bonds: 3, Aromatic rings: 1
# Drug-likeness: Lipinski pass
# Alerts: none (PAINS, Brenk)
# QED: 0.56

Field glossary:

MW (molecular weight) / LogP (lipophilicity) / TPSA (topological polar surface area) / HBD/HBA (hydrogen bond donors/acceptors): physicochemical properties
Lipinski pass: whether the molecule meets Lipinski's Rule of Five for oral bioavailability
PAINS / Brenk: structural alert filters that flag substructures known to cause false positives in drug discovery screens
QED (Quantitative Estimate of Drug-likeness): drug-likeness score from 0 to 1

mol.describe() returns a more detailed text description. chematic.molecule_report(mol) generates an HTML report.

Embedding in an LLM prompt

Including this summary in the prompt lets the LLM reference real property values while generating a response:

mol = chematic.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
context = mol.review()

prompt = f"""
Answer the question below about the following molecule.

Molecule info:
{context}

Question: Evaluate the oral absorption potential of this compound.
"""

Compared to passing only SMILES, the LLM can now reason explicitly: "LogP 1.31 suggests moderate lipophilicity" and "Lipinski pass indicates likely oral absorption." This is the library-side counterpart to MolE-RAG's "what to pass" question.

3. ECFP4 / Tanimoto / LSH: similarity search

Motivation

To implement MolE-RAG's idea of "add structurally similar molecules to context," you need fingerprint computation and similarity search.

A molecular fingerprint encodes a molecule's substructures as a bit vector. ECFP4 is the standard approach, using substructures within a 2-hop radius of each atom. Molecules with similar structures produce similar fingerprints.

Implementation

aspirin = chematic.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
ibuprofen = chematic.from_smiles("CC(C)Cc1ccc(cc1)C(C)C(=O)O")

# compute fingerprints and similarity
fp1 = aspirin.ecfp4()
fp2 = ibuprofen.ecfp4()
sim = chematic.tanimoto(fp1, fp2)  # Tanimoto coefficient: 0.0 (unrelated) to 1.0 (identical)

# pairwise Tanimoto matrix for a collection (parallelized in Rust)
matrix = chematic.bulk.tanimoto_matrix(mols)

# approximate nearest-neighbor search with LSH (Locality Sensitive Hashing)
index = chematic.SimilarityIndex(mols)
similar = index.query(aspirin, k=10)  # 10 molecules most similar to aspirin

The Tanimoto coefficient is the standard similarity metric in drug discovery. LSH enables fast approximate nearest-neighbor search over large compound libraries.

Using it in a RAG pipeline

Similarity search can serve as the retrieval step in a RAG pipeline, adding known compounds similar to the query molecule as context:

# build an index from a known compound library
library = [chematic.from_smiles(s) for s in known_compounds]
index = chematic.SimilarityIndex(library)

# retrieve similar compounds and build context
query = chematic.from_smiles("CC(=O)Oc1ccccc1C(=O)O")
similar = index.query(query, k=5)

context = "\n".join([
    f"Similar compound {i+1}:\n{mol.review()}"
    for i, mol in enumerate(similar)
])

ECFP4 computation, Tanimoto matrices, and LSH indexing all run on the Rust side, so calling from Python stays practical even for libraries of several thousand molecules.

4. MCP server: calling chemistry tools from an LLM

Motivation

The arXiv paper "ChemToolAgent" shows that for specialized chemistry tasks like reaction prediction and compound screening, calling external tools outperforms having a general-purpose LLM handle everything directly. "ChemMCP" is the MCP-compatible toolkit released alongside that work.

The design principle: don't make the LLM do chemistry calculations. Let the LLM judge and explain; hand the math to deterministic tools.

Implementation

chematic ships a built-in MCP (Model Context Protocol) server. Any MCP client — Claude Desktop, Cursor, etc. — can call chematic's chemistry tools directly from an agent.

{
  "mcpServers": {
    "chematic": {
      "command": "chematic",
      "args": ["mcp"]
    }
  }
}

Representative tools:

tool	description
`calc_properties`	MW, LogP, TPSA, HBD/HBA, etc.
`smarts_match`	SMARTS substructure matching
`pains_check`	PAINS alert detection
`generate_3d`	3D coordinate generation (MMFF94 force field)
`tanimoto`	Tanimoto similarity
`admet_profile`	ADMET (absorption, distribution, metabolism, excretion, toxicity) prediction

Full list of 15 tools in the README.

In practice

With the config above, you can ask Claude Desktop chemistry questions in plain English:

User: Give me the ADMET profile for aspirin (CC(=O)Oc1ccccc1C(=O)O)

Claude: [calling chematic:admet_profile...]
        Absorption (A): LogP 1.31, TPSA 63.6 — within the range for oral absorption.
        Distribution (D): Moderate protein binding predicted.
        Metabolism (M): Likely metabolized by CYP2C9.
        Excretion (E): Renal excretion predicted as primary route.
        Toxicity (T): No PAINS alerts. No Brenk alerts.

The LLM doesn't compute anything itself — it calls admet_profile, receives a deterministic result, and uses that to generate the explanation. This eliminates the hallucination risk of an LLM making up property values.

Summary

Problem	Implementation
SMILES is too implicit for LLMs	ChemicalJSON (`mol.to_cjson`)
Need to pass molecule context as a bundle	`describe` / `review` / `report`
Need similar molecules for RAG context	ECFP4, Tanimoto, LSH
Don't want LLMs doing chemistry math	MCP server (15 tools)