<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kent-tokyo</title>
    <description>The latest articles on DEV Community by kent-tokyo (@kent-tokyo).</description>
    <link>https://dev.to/kent-tokyo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3936287%2Fbb5ec013-43fb-485c-b099-db72395640b5.png</url>
      <title>DEV Community: kent-tokyo</title>
      <link>https://dev.to/kent-tokyo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kent-tokyo"/>
    <language>en</language>
    <item>
      <title>How you pass molecules to an LLM matters: features I built into my Rust cheminformatics library after reading recent arXiv papers</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sat, 27 Jun 2026 04:52:25 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/how-you-pass-molecules-to-an-llm-matters-features-i-built-into-my-rust-cheminformatics-library-56l9</link>
      <guid>https://dev.to/kent-tokyo/how-you-pass-molecules-to-an-llm-matters-features-i-built-into-my-rust-cheminformatics-library-56l9</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Simply passing a SMILES string into an LLM prompt is not enough to make the model reason correctly about molecular structure.&lt;/p&gt;

&lt;p&gt;Take aspirin: its SMILES is &lt;code&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/code&gt;. A chemist can read it, but an LLM has to parse which atom is bonded to which from a flat string — putting it at a disadvantage on structure-understanding tasks.&lt;/p&gt;

&lt;p&gt;Reading recent arXiv papers, three directions emerge for addressing this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better molecular representations&lt;/strong&gt;: use a format more explicit than SMILES&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Richer context&lt;/strong&gt;: pass property data and similar molecules alongside the structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Role separation&lt;/strong&gt;: let LLMs judge and explain; hand calculations to deterministic tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I've been building &lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;chematic&lt;/a&gt;, a cheminformatics library written in pure Rust with Python, WASM, and MCP support. It handles molecule parsing, property calculation, similarity search, 3D generation, and more. This article covers the features I added with these three directions in mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. ChemicalJSON: explicit graph representation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Motivation
&lt;/h3&gt;

&lt;p&gt;The arXiv paper &lt;a href="https://arxiv.org/abs/2605.01822" rel="noopener noreferrer"&gt;"Molecular Representations for Large Language Models"&lt;/a&gt; (May 2026) asks what a "good" molecular representation looks like for an LLM.&lt;/p&gt;

&lt;p&gt;It introduces MolJSON and benchmarks it against five common formats — SMILES, IUPAC name, InChI, and others. The results show that &lt;strong&gt;explicit graph representations&lt;/strong&gt; outperform compressed string formats on structure-understanding tasks. On a shortest-path reasoning benchmark with GPT-5, MolJSON reached 98.5% accuracy vs. 92.2% for SMILES and 82.7% for IUPAC names.&lt;/p&gt;

&lt;p&gt;SMILES compresses molecular structure into a string, so an LLM must implicitly parse "which atom bonds to which." An explicit JSON with an atom list and bond list makes that structure directly readable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;c1ccccc1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# benzene
&lt;/span&gt;
&lt;span class="c1"&gt;# convert to explicit graph JSON
&lt;/span&gt;&lt;span class="n"&gt;cjson&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_cjson&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# reconstruct from JSON
&lt;/span&gt;&lt;span class="n"&gt;mol2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_cjson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cjson&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"atoms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"element"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"C"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"bonds"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"begin"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"order"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aromatic"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The fact that &lt;code&gt;c1ccccc1&lt;/code&gt; represents six carbons in a ring is implicit in the string but explicit in the JSON. LLMs and AI agents can reference the structure directly in this form.&lt;/p&gt;

&lt;p&gt;chematic implements ChemicalJSON rather than MolJSON itself, but the intent is the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Molecule context: describe / review / report
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Motivation
&lt;/h3&gt;

&lt;p&gt;The arXiv paper &lt;a href="https://arxiv.org/abs/2606.05693" rel="noopener noreferrer"&gt;"MolE-RAG"&lt;/a&gt; proposes a way to improve LLM-based molecular property prediction. Instead of passing only a SMILES string, it passes property descriptors, structural alerts, and structurally similar known compounds as &lt;strong&gt;inference-time context&lt;/strong&gt; — applying the RAG (Retrieval-Augmented Generation) idea to molecular reasoning. Results show up to 28-point ROC-AUC improvement over a SMILES-only baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;mol.review()&lt;/code&gt; returns a text summary of a molecule:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# aspirin
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
&lt;span class="c1"&gt;# MW: 180.2, LogP: 1.31, TPSA: 63.6, HBD: 1, HBA: 3
# Rotatable bonds: 3, Aromatic rings: 1
# Drug-likeness: Lipinski pass
# Alerts: none (PAINS, Brenk)
# QED: 0.56
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Field glossary:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;MW&lt;/strong&gt; (molecular weight) / &lt;strong&gt;LogP&lt;/strong&gt; (lipophilicity) / &lt;strong&gt;TPSA&lt;/strong&gt; (topological polar surface area) / &lt;strong&gt;HBD/HBA&lt;/strong&gt; (hydrogen bond donors/acceptors): physicochemical properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lipinski pass&lt;/strong&gt;: whether the molecule meets Lipinski's Rule of Five for oral bioavailability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PAINS / Brenk&lt;/strong&gt;: structural alert filters that flag substructures known to cause false positives in drug discovery screens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;QED&lt;/strong&gt; (Quantitative Estimate of Drug-likeness): drug-likeness score from 0 to 1&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;mol.describe()&lt;/code&gt; returns a more detailed text description. &lt;code&gt;chematic.molecule_report(mol)&lt;/code&gt; generates an HTML report.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding in an LLM prompt
&lt;/h3&gt;

&lt;p&gt;Including this summary in the prompt lets the LLM reference real property values while generating a response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
Answer the question below about the following molecule.

Molecule info:
&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;

Question: Evaluate the oral absorption potential of this compound.
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compared to passing only SMILES, the LLM can now reason explicitly: "LogP 1.31 suggests moderate lipophilicity" and "Lipinski pass indicates likely oral absorption." This is the library-side counterpart to MolE-RAG's "what to pass" question.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. ECFP4 / Tanimoto / LSH: similarity search
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Motivation
&lt;/h3&gt;

&lt;p&gt;To implement MolE-RAG's idea of "add structurally similar molecules to context," you need fingerprint computation and similarity search.&lt;/p&gt;

&lt;p&gt;A molecular fingerprint encodes a molecule's substructures as a bit vector. ECFP4 is the standard approach, using substructures within a 2-hop radius of each atom. Molecules with similar structures produce similar fingerprints.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;aspirin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ibuprofen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(C)Cc1ccc(cc1)C(C)C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# compute fingerprints and similarity
&lt;/span&gt;&lt;span class="n"&gt;fp1&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;aspirin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ecfp4&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;fp2&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ibuprofen&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ecfp4&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanimoto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;fp1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fp2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Tanimoto coefficient: 0.0 (unrelated) to 1.0 (identical)
&lt;/span&gt;
&lt;span class="c1"&gt;# pairwise Tanimoto matrix for a collection (parallelized in Rust)
&lt;/span&gt;&lt;span class="n"&gt;matrix&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bulk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tanimoto_matrix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# approximate nearest-neighbor search with LSH (Locality Sensitive Hashing)
&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SimilarityIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mols&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;similar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;aspirin&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# 10 molecules most similar to aspirin
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Tanimoto coefficient is the standard similarity metric in drug discovery. LSH enables fast approximate nearest-neighbor search over large compound libraries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using it in a RAG pipeline
&lt;/h3&gt;

&lt;p&gt;Similarity search can serve as the retrieval step in a RAG pipeline, adding known compounds similar to the query molecule as context:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# build an index from a known compound library
&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;known_compounds&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;SimilarityIndex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;library&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# retrieve similar compounds and build context
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;similar&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similar compound &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;review&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;similar&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;ECFP4 computation, Tanimoto matrices, and LSH indexing all run on the Rust side, so calling from Python stays practical even for libraries of several thousand molecules.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. MCP server: calling chemistry tools from an LLM
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Motivation
&lt;/h3&gt;

&lt;p&gt;The arXiv paper &lt;a href="https://arxiv.org/abs/2411.07228" rel="noopener noreferrer"&gt;"ChemToolAgent"&lt;/a&gt; shows that for specialized chemistry tasks like reaction prediction and compound screening, calling external tools outperforms having a general-purpose LLM handle everything directly. "ChemMCP" is the MCP-compatible toolkit released alongside that work.&lt;/p&gt;

&lt;p&gt;The design principle: don't make the LLM do chemistry calculations. Let the LLM judge and explain; hand the math to deterministic tools.&lt;/p&gt;

&lt;h3&gt;
  
  
  Implementation
&lt;/h3&gt;

&lt;p&gt;chematic ships a built-in MCP (Model Context Protocol) server. Any MCP client — Claude Desktop, Cursor, etc. — can call chematic's chemistry tools directly from an agent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"chematic"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"chematic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"mcp"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Representative tools:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;tool&lt;/th&gt;
&lt;th&gt;description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;calc_properties&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;MW, LogP, TPSA, HBD/HBA, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;smarts_match&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;SMARTS substructure matching&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;pains_check&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;PAINS alert detection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;generate_3d&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3D coordinate generation (MMFF94 force field)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;tanimoto&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Tanimoto similarity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;admet_profile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ADMET (absorption, distribution, metabolism, excretion, toxicity) prediction&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Full list of 15 tools in the &lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;README&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  In practice
&lt;/h3&gt;

&lt;p&gt;With the config above, you can ask Claude Desktop chemistry questions in plain English:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Give me the ADMET profile for aspirin (CC(=O)Oc1ccccc1C(=O)O)

Claude: [calling chematic:admet_profile...]
        Absorption (A): LogP 1.31, TPSA 63.6 — within the range for oral absorption.
        Distribution (D): Moderate protein binding predicted.
        Metabolism (M): Likely metabolized by CYP2C9.
        Excretion (E): Renal excretion predicted as primary route.
        Toxicity (T): No PAINS alerts. No Brenk alerts.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM doesn't compute anything itself — it calls &lt;code&gt;admet_profile&lt;/code&gt;, receives a deterministic result, and uses that to generate the explanation. This eliminates the hallucination risk of an LLM making up property values.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;th&gt;Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;SMILES is too implicit for LLMs&lt;/td&gt;
&lt;td&gt;ChemicalJSON (&lt;code&gt;mol.to_cjson&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need to pass molecule context as a bundle&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;describe&lt;/code&gt; / &lt;code&gt;review&lt;/code&gt; / &lt;code&gt;report&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Need similar molecules for RAG context&lt;/td&gt;
&lt;td&gt;ECFP4, Tanimoto, LSH&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Don't want LLMs doing chemistry math&lt;/td&gt;
&lt;td&gt;MCP server (15 tools)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;chematic aims to be a foundation for this direction — pure Rust, with Python, WASM, and MCP interfaces.&lt;/p&gt;

&lt;p&gt;A separate article on 3D generation and force fields (MMFF94 / DREIDING) is coming.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;github.com/kent-tokyo/chematic&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2605.01822" rel="noopener noreferrer"&gt;Molecular Representations for Large Language Models&lt;/a&gt; (May 2026)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2606.05693" rel="noopener noreferrer"&gt;MolE-RAG: Molecular Structure-Enhanced Retrieval-Augmented Generation for Chemistry&lt;/a&gt; (June 2026)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2411.07228" rel="noopener noreferrer"&gt;ChemToolAgent: The Impact of Tools on Language Agents for Chemistry Problem Solving&lt;/a&gt; (November 2024; ChemMCP is the toolkit released with this paper)&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rust</category>
      <category>chemistry</category>
      <category>cheminformatics</category>
      <category>llm</category>
    </item>
    <item>
      <title>Building a Computer-Aided Synthesis Planning Engine in Pure Rust</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sun, 21 Jun 2026 09:34:20 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/building-a-computer-aided-synthesis-planning-engine-in-pure-rust-3jg4</link>
      <guid>https://dev.to/kent-tokyo/building-a-computer-aided-synthesis-planning-engine-in-pure-rust-3jg4</guid>
      <description>&lt;p&gt;I've been building &lt;code&gt;renkin&lt;/code&gt;, a retrosynthesis engine in Pure Rust. You give it a target molecule as a SMILES string and it tries to find synthesis routes back to commercially available starting materials.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kent-tokyo/renkin" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/renkin&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Retrosynthesis&lt;/strong&gt; is how organic chemists plan a synthesis: instead of asking "how do I make this?", you ask "what reaction could have produced this, and where do those precursors come from?" — working backwards from the target until you reach things you can actually buy. &lt;code&gt;renkin&lt;/code&gt; automates that search.&lt;/p&gt;

&lt;p&gt;SMILES is a text notation for molecular structure. Aspirin is &lt;code&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/code&gt;. That's what you pass in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Search design
&lt;/h2&gt;

&lt;p&gt;The straightforward approach — try every applicable reaction rule at every intermediate — runs into combinatorial explosion fast. A single molecule can match hundreds of rules, they apply recursively to every precursor, and the space grows exponentially. I needed something smarter.&lt;/p&gt;

&lt;h3&gt;
  
  
  AND-OR tree
&lt;/h3&gt;

&lt;p&gt;Retrosynthesis search has a structure that doesn't fit standard graph search well. At any step, you can choose between reactions (either A or B works — an OR), but each reaction requires all its precursors simultaneously (AND). Standard graph search conflates these two, which messes up the cost accounting. &lt;code&gt;renkin&lt;/code&gt; models the space as an AND-OR tree and searches it accordingly.&lt;/p&gt;

&lt;h3&gt;
  
  
  A* with SA Score
&lt;/h3&gt;

&lt;p&gt;For the A* heuristic, I use the &lt;strong&gt;SA Score (Synthetic Accessibility Score)&lt;/strong&gt; — a 1–10 number for how synthetically accessible a molecule is, where lower is easier. The idea is that lower SA Score intermediates are more likely to show up in building block catalogs, so steering the search in that direction tends to find better routes. It worked reasonably well in practice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Beam search
&lt;/h3&gt;

&lt;p&gt;For large molecules, even the AND-OR + A* combination can get out of hand. Beam search caps the candidates per step at N, which makes the computation predictable at the cost of some precision.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reaction rules
&lt;/h2&gt;

&lt;p&gt;Rules are written in SMARTS (a pattern language for chemical structures). The current set has 314:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;31&lt;/strong&gt;: hand-crafted rules for the most common reaction types — amide bond formation, esterification, Suzuki coupling, and similar&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;283&lt;/strong&gt;: automatically extracted from the USPTO reaction database using rdchiral&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hand-crafted ones tend to be cleaner but don't cover much ground. The auto-extracted ones add coverage but come with noise. Template frequency weighting — giving higher priority to rules that appear more often in USPTO — turned out to be the biggest single factor in accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;USPTO-50k (4,907-molecule test set) is the standard evaluation for retrosynthesis tools. Here's how the numbers changed as I added each piece:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Configuration&lt;/th&gt;
&lt;th&gt;Solved&lt;/th&gt;
&lt;th&gt;Rate&lt;/th&gt;
&lt;th&gt;Rules&lt;/th&gt;
&lt;th&gt;depth&lt;/th&gt;
&lt;th&gt;beam&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v0.1.0 initial (hand-crafted only)&lt;/td&gt;
&lt;td&gt;366/4907&lt;/td&gt;
&lt;td&gt;7.5%&lt;/td&gt;
&lt;td&gt;31&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ auto templates (top-300)&lt;/td&gt;
&lt;td&gt;1363/4907&lt;/td&gt;
&lt;td&gt;27.8%&lt;/td&gt;
&lt;td&gt;222&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ depth=5, top-500 templates&lt;/td&gt;
&lt;td&gt;2315/4907&lt;/td&gt;
&lt;td&gt;47.2%&lt;/td&gt;
&lt;td&gt;314&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ beam=100&lt;/td&gt;
&lt;td&gt;2688/4907&lt;/td&gt;
&lt;td&gt;54.8%&lt;/td&gt;
&lt;td&gt;314&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;+ template frequency weighting&lt;/td&gt;
&lt;td&gt;~3484/4907&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~71%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;314&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The ~71% in the last row is confirmed on 100 molecules, not the full 4,907 — take it as a directional figure.&lt;/p&gt;

&lt;p&gt;Comparison with other tools (same train/test split):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;th&gt;USPTO-50k&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;renkin&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;A* + AND-OR tree&lt;/td&gt;
&lt;td&gt;~71% (approx.)†&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GLG&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;58.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LocalRetro&lt;/td&gt;
&lt;td&gt;Neural network&lt;/td&gt;
&lt;td&gt;53.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AiZynthFinder&lt;/td&gt;
&lt;td&gt;MCTS&lt;/td&gt;
&lt;td&gt;45–53%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retro*&lt;/td&gt;
&lt;td&gt;AND-OR tree search&lt;/td&gt;
&lt;td&gt;44.3%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASKCOS&lt;/td&gt;
&lt;td&gt;MCTS&lt;/td&gt;
&lt;td&gt;41%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;† renkin's figure is from a 100-molecule sample; other tools used the full 4,907. This comparison still needs more work — I haven't verified whether the number holds at full scale.&lt;/p&gt;

&lt;p&gt;The jump from template frequency weighting alone was larger than I expected. It's the thing I'd add first if starting over.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Pure Rust
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;renkin&lt;/code&gt; is built on &lt;code&gt;chematic&lt;/code&gt;, a Pure Rust cheminformatics library I wrote earlier.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/chematic&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That means SMARTS matching, molecular graph operations, and SA Score calculation are all in safe Rust, no FFI. &lt;code&gt;cargo build&lt;/code&gt; is enough, and it compiles to WebAssembly (~500 KB). For parallel rule application, &lt;code&gt;renkin&lt;/code&gt; uses &lt;code&gt;rayon&lt;/code&gt; — including a WASM-compatible build that runs through Web Workers, though that path hasn't had as much testing yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  Usage
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Python
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;renkin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;renkin&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;renkin&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_routes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# aspirin
&lt;/span&gt;    &lt;span class="n"&gt;depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;max_routes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;routes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;steps&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;target&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; + &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;precursors&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;  [&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;step&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;rule&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  CLI (Rust)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;renkin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;renkin &lt;span class="nt"&gt;--target&lt;/span&gt; &lt;span class="s2"&gt;"CC(=O)Oc1ccccc1C(=O)O"&lt;/span&gt; &lt;span class="nt"&gt;--depth&lt;/span&gt; 5 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--templates&lt;/span&gt; data/templates_extracted.smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  JavaScript / Node.js
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;renkin
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;init&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;find_routes&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;renkin&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;find_routes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What's left
&lt;/h2&gt;

&lt;p&gt;314 rules aren't enough for complex molecules like natural products — success rates drop there. I want to try pulling more templates from sources beyond USPTO.&lt;/p&gt;

&lt;p&gt;Scoring routes by step count, yield, and cost (rather than just solved/not-solved) is also on the list. And a browser UI for stepping through the AND-OR tree is in progress.&lt;/p&gt;




&lt;p&gt;Retrosynthesis engine "renkin":&lt;br&gt;
&lt;a href="https://github.com/kent-tokyo/renkin" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/renkin&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The cheminformatics library underneath, "chematic":&lt;br&gt;
&lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/chematic&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>cheminformatics</category>
      <category>webassembly</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building chematic: Why I Wrote a Pure-Rust Cheminformatics Library from Scratch</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Fri, 19 Jun 2026 09:34:51 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/building-chematic-why-i-wrote-a-pure-rust-cheminformatics-library-from-scratch-5f8h</link>
      <guid>https://dev.to/kent-tokyo/building-chematic-why-i-wrote-a-pure-rust-cheminformatics-library-from-scratch-5f8h</guid>
      <description>&lt;p&gt;I'm building &lt;code&gt;chematic&lt;/code&gt;, a cheminformatics library in pure Rust, from scratch. Here's why I started, and what kept tripping me up along the way.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it started
&lt;/h2&gt;

&lt;p&gt;"I just want to run RDKit in the browser" — that was it.&lt;/p&gt;

&lt;p&gt;RDKit.js exists. But the WASM binary is over 30 MB, and building it requires cmake, clang, and the Emscripten SDK. CI kept breaking, Docker images bloated, and every deploy meant setting up the build environment again. Too much overhead for what I wanted to do.&lt;/p&gt;

&lt;p&gt;I tried OCL.js (OpenChemLib) too, but it's Java code transpiled via GWT, so the API feels Java-shaped and TypeScript types don't come out cleanly.&lt;/p&gt;

&lt;p&gt;That's when I thought: what if I just wrote it in pure Rust? At the time, I figured a few weeks would be enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I imposed the pure-Rust constraint
&lt;/h2&gt;

&lt;p&gt;I didn't set "no FFI" as a rule from the start. But as I kept writing, I could see that allowing one exception would cascade.&lt;/p&gt;

&lt;p&gt;Take InChI. The official IUPAC implementation is in C, and reimplementing it correctly in pure Rust isn't realistic. The moment I said "I'll use FFI just for this," cmake and Emscripten dependencies would come back. Better to draw the line early. No FFI, no &lt;code&gt;unsafe&lt;/code&gt;, no random number generation. Those three constraints went in first.&lt;/p&gt;

&lt;p&gt;That "InChI is out" decision would come back to bite me from the outside.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting to 15 crates
&lt;/h2&gt;

&lt;p&gt;I started with a single crate. As files multiplied and compile times grew, I split out &lt;code&gt;chematic-core&lt;/code&gt; and &lt;code&gt;chematic-smiles&lt;/code&gt; first. Every time a new feature landed, I carved it out, and now there are 15 crates.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chematic-core       → atom/bond/molecule primitives, kekulization
chematic-smiles     → OpenSMILES parser, canonical SMILES writer
chematic-perception → ring perception (SSSR), aromaticity
chematic-mol        → MOL/SDF file I/O
chematic-depict     → 2D SVG rendering (CPK colors, SMARTS highlighting)
chematic-chem       → 70+ descriptors, pKa prediction, ADMET profiling
chematic-fp         → 6 fingerprint types (ECFP, MACCS, MAP4, etc.)
chematic-smarts     → SMARTS parser, substructure search (VF2)
chematic-ff         → force field implementations (UFF, DREIDING, MMFF94)
chematic-3d         → 3D coordinate generation (ETKDG), conformer handling
chematic-rxn        → reaction SMILES/SMIRKS parser
chematic-wasm       → JavaScript/TypeScript WASM bindings
chematic-iupac      → IUPAC name generation (25+ compound classes)
chematic-mcp        → MCP server for AI agents (14 tools)
chematic            → umbrella crate integrating everything
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The trickiest part was the dependency graph. &lt;code&gt;chematic-perception&lt;/code&gt; (ring perception) depends on &lt;code&gt;chematic-core&lt;/code&gt;, but the kekulization code inside &lt;code&gt;core&lt;/code&gt; needs ring perception results. To break the cycle, I put the kekulization interface in &lt;code&gt;core&lt;/code&gt; and the implementation in &lt;code&gt;perception&lt;/code&gt;. Not the cleanest design, but it works for now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting stuck on the SMILES parser
&lt;/h2&gt;

&lt;p&gt;"SMILES is a simple notation, so parsing it should be simple" — wrong. The OpenSMILES spec has enough ambiguity that for edge cases I ended up looking at RDKit's behavior and using that as the reference.&lt;/p&gt;

&lt;p&gt;The first wall was branch handling. Writing a recursive-descent parser for deeply nested structures like &lt;code&gt;C(CC(N)CC)(=O)O&lt;/code&gt; made stack management awkward. I rewrote it as an iterative implementation with an explicit stack.&lt;/p&gt;

&lt;p&gt;Implicit hydrogen was also quietly painful. SMILES usually omits hydrogens and calculates them from valence. That calculation is more complex than it looks — atom type, charge, and valence model all interact — and when I ran ChEMBL molecules through it, mismatches kept showing up and I had to fix them multiple times.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting stuck on kekulization
&lt;/h2&gt;

&lt;p&gt;Kekulization is the conversion of aromatic SMILES (lowercase atoms like &lt;code&gt;c1ccccc1&lt;/code&gt;) into explicit single/double bond form (&lt;code&gt;C1=CC=CC=C1&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;It's a bipartite graph maximum matching problem, and the algorithm itself is well-known. What I didn't anticipate was scale: for large fused ring systems like porphyrins and natural products, the matching search blows up exponentially. I only noticed this when I ran the full ChEMBL dataset — molecules over MW 5000 were timing out.&lt;/p&gt;

&lt;p&gt;Atoms with only two adjacent aromatic bonds have a unique solution for which bond gets the double bond. Processing those first and reducing the graph before running the matching fixed the timeouts for large molecules. That fix went into v0.4.6.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fingerprint determinism
&lt;/h2&gt;

&lt;p&gt;The thing that caught me in the ECFP (Morgan algorithm) implementation was atom traversal order.&lt;/p&gt;

&lt;p&gt;Rust's &lt;code&gt;HashMap&lt;/code&gt; randomizes its seed by default (HashDoS mitigation). So when traversing molecular graph atoms via a hash map, the order changes every run — same molecule, different bit vector. The symptom was "Tanimoto similarity varies between runs," and it took a while to track down.&lt;/p&gt;

&lt;p&gt;I switched to sorting all atoms by atomic number, degree, charge, and isotope mass before computing, and replaced &lt;code&gt;AHashMap&lt;/code&gt; with &lt;code&gt;IndexMap&lt;/code&gt; for deterministic ordering. Setting the no-random-numbers constraint upfront probably saved me from shipping a "works for now" implementation and hitting the reproducibility bug later.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fighting the borrow checker in CIP
&lt;/h2&gt;

&lt;p&gt;CIP (Cahn–Ingold–Prelog) rules determine R/S for stereocenters and E/Z for double bonds. Assigning priority to an atom requires the priorities of its neighbors, which requires their neighbors, and so on — a recursive dependency.&lt;/p&gt;

&lt;p&gt;Understanding the algorithm wasn't the hard part. Getting it past the borrow checker was. Holding the graph as &lt;code&gt;Rc&amp;lt;RefCell&amp;lt;Node&amp;gt;&amp;gt;&lt;/code&gt; caused other issues; I tried an Arena pattern too. Eventually I settled on a flat &lt;code&gt;Vec&amp;lt;Atom&amp;gt;&lt;/code&gt; with &lt;code&gt;AtomIdx&lt;/code&gt; (&lt;code&gt;usize&lt;/code&gt; newtype) for indirection. No need to hold &lt;code&gt;&amp;amp;mut&lt;/code&gt; and &lt;code&gt;&amp;amp;&lt;/code&gt; simultaneously during traversal, so the borrow checker was happy. Took a few days, but once this pattern clicked, I used it for other algorithms too.&lt;/p&gt;

&lt;h2&gt;
  
  
  ChEMBL full validation
&lt;/h2&gt;

&lt;p&gt;Mid-development, unit tests would pass completely while the parser crashed on specific ChEMBL molecules.&lt;/p&gt;

&lt;p&gt;The causes were roughly three kinds: SSSR count mismatches on rare fused ring systems, hydrogen valence miscalculations from non-standard SMILES, and the kekulization timeouts mentioned above. None of these were cases I would have thought to write unit tests for — they only surface when you run real data.&lt;/p&gt;

&lt;p&gt;Eventually I parsed all 2,897,819 molecules from ChEMBL 37 without a failure. In the RDKit compatibility benchmark on 5,000 molecules, HBA (H-bond acceptor) count agreement reached 99.98%, and aromatic ring count agreement hit 95.6% — the remaining gap comes from differing treatment of fused N-heterocycles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Issue #11: dropping InChI
&lt;/h2&gt;

&lt;p&gt;On June 16, 2026, an external issue came in:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"100% errors on InChI generation vs IUPAC reference InChI"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Even benzene produced the wrong connectivity block. The root causes identified were: InChI's own canonical numbering algorithm (different from Morgan ordering) wasn't implemented, spurious stereochemistry layers were being added to molecules without stereocenters, tautomer and mobile-hydrogen normalization was missing, and the InChIKey hash conversion diverged from spec — not isolated bugs, but a design-level failure.&lt;/p&gt;

&lt;p&gt;InChI's spec is effectively defined by the IUPAC C library implementation. Reimplementing that correctly from scratch in pure Rust, without following the C library, isn't realistic. I'd expected this moment when I banned FFI, so I deleted &lt;code&gt;chematic-inchi&lt;/code&gt; and documented InChI/InChIKey as an unsupported limitation.&lt;/p&gt;

&lt;h2&gt;
  
  
  WASM binding design
&lt;/h2&gt;

&lt;p&gt;I use &lt;code&gt;wasm-bindgen&lt;/code&gt; to expose Rust functions to JavaScript. For complex return types, &lt;code&gt;wasm-bindgen&lt;/code&gt; can't pass Rust types directly, so I serialize to JSON via &lt;code&gt;serde_json&lt;/code&gt; and call &lt;code&gt;JSON.parse()&lt;/code&gt; on the JS side.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_descriptors_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`MW: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mw&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, TPSA: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tpsa&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, LogP: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logP&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's not suited for high-frequency computation, but chemistry calculations don't typically need that, so it hasn't been a bottleneck. TypeScript type definitions are auto-generated by &lt;code&gt;wasm-bindgen&lt;/code&gt;, so IDE autocomplete works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Rust actually helped
&lt;/h2&gt;

&lt;p&gt;Not all of it was painful.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;BondOrder&lt;/code&gt; is an enum, so calling "count double bonds" on an un-kekulized aromatic molecule is a compile error. In C++ that silently returns a wrong value.&lt;/p&gt;

&lt;p&gt;The 3D force field implementations (UFF, DREIDING, MMFF94) were also written without a single line of &lt;code&gt;unsafe&lt;/code&gt;. Translating the math into safe Rust was more straightforward than I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  Current state
&lt;/h2&gt;

&lt;p&gt;v0.4.6 (June 19, 2026) is the latest release. 1,991 tests pass across the workspace. The WASM binary is ~550 KB after &lt;code&gt;wasm-opt&lt;/code&gt;, available as an npm package (&lt;code&gt;@kent-tokyo/chematic&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Recent additions include pKa prediction, ADMET profiling, BOILED-Egg passive permeability classification, and an MCP server (14 tools) for AI agent integration. The live demo runs all of it client-side in WASM.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kent-tokyo.github.io/chematic/" rel="noopener noreferrer"&gt;https://kent-tokyo.github.io/chematic/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/chematic&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>cheminformatics</category>
      <category>opensource</category>
      <category>webassembly</category>
    </item>
    <item>
      <title>5 Features That Make chematic Stand Out as a Pure-Rust Cheminformatics Library</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Fri, 12 Jun 2026 14:44:42 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/5-features-that-make-chematic-stand-out-as-a-pure-rust-cheminformatics-library-4oh7</link>
      <guid>https://dev.to/kent-tokyo/5-features-that-make-chematic-stand-out-as-a-pure-rust-cheminformatics-library-4oh7</guid>
      <description>&lt;p&gt;I'm building chematic, a pure-Rust cheminformatics toolkit. Every library worth using—RDKit, OpenBabel, CDK—requires C/C++ at its core. chematic targets RDKit-level coverage with zero FFI: compiles to WASM, native binaries, and everything in between, without a single line of C or C++.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;chematic&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://kent-tokyo.github.io/chematic/" rel="noopener noreferrer"&gt;Try the live demo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Pure Rust — Reproducible and Instant Builds
&lt;/h2&gt;

&lt;p&gt;Pure-Rust implementation means chematic compiles to identical binaries on any system—Linux, macOS, Windows, CI/CD pipelines, Docker containers—with zero configuration. No cmake, no pkg-config, no "works on my machine" surprises.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;chematic&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare with the typical RDKit setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# RDKit setup&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;brew &lt;span class="nb"&gt;install &lt;/span&gt;cmake boost eigen
&lt;span class="nv"&gt;$ &lt;/span&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;rdkit
&lt;span class="c"&gt;# 5-10 minutes of C++ compilation, system-dependent&lt;/span&gt;

&lt;span class="c"&gt;# chematic setup&lt;/span&gt;
&lt;span class="nv"&gt;$ &lt;/span&gt;cargo add chematic
&lt;span class="nv"&gt;$ &lt;/span&gt;cargo build &lt;span class="nt"&gt;--release&lt;/span&gt;
   Compiling chematic v0.1.89
    Finished &lt;span class="sb"&gt;`&lt;/span&gt;release&lt;span class="sb"&gt;`&lt;/span&gt; profile ... &lt;span class="k"&gt;in &lt;/span&gt;4.23s
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No cmake version mismatches, no missing nasm, no Boost library conflicts. Advantages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CI/CD reliability&lt;/strong&gt; — builds don't break due to system library drift&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Containerization&lt;/strong&gt; — minimal layer dependencies, smaller Docker images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-compilation&lt;/strong&gt; — compile for Linux/macOS/Windows/WASM from a single Rust toolchain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedded systems&lt;/strong&gt; — deploy to constrained environments without external build tools&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  WebAssembly as a First-Class Target
&lt;/h2&gt;

&lt;p&gt;chematic exports 137 JavaScript functions via the npm package &lt;code&gt;@kent-tokyo/chematic&lt;/code&gt;. The demo at &lt;a href="https://kent-tokyo.github.io/chematic/" rel="noopener noreferrer"&gt;https://kent-tokyo.github.io/chematic/&lt;/a&gt; shows 8 feature tabs (2D Viewer, Similarity, Molecular Report, Reaction, SAR Analysis, 3D Viewer, Gallery, Dynamics), all running client-side in pure WASM.&lt;/p&gt;

&lt;p&gt;Embed cheminformatics directly in web applications:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;descriptors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_descriptors_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;     &lt;span class="c1"&gt;// {mw: 180.15, logp: 1.19, tpsa: 63.6, ...}&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;scaffold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;murcko_scaffold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fragments&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;brics_fragments_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Interactive lead optimization&lt;/strong&gt; — real-time descriptor updates as users draw structures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline-first tools&lt;/strong&gt; — no backend server, no SaaS fees, no internet required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Collaborative chemistry&lt;/strong&gt; — embed in Jupyter notebooks, Observable, or design tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instant feedback&lt;/strong&gt; — sub-millisecond property calculations in the browser&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Feature Flags for Modularity
&lt;/h2&gt;

&lt;p&gt;chematic ships with 10 optional feature flags: &lt;code&gt;smiles&lt;/code&gt;, &lt;code&gt;perception&lt;/code&gt;, &lt;code&gt;mol&lt;/code&gt;, &lt;code&gt;depict&lt;/code&gt;, &lt;code&gt;fp&lt;/code&gt;, &lt;code&gt;chem&lt;/code&gt;, &lt;code&gt;smarts&lt;/code&gt;, &lt;code&gt;rxn&lt;/code&gt;, &lt;code&gt;threed&lt;/code&gt;, &lt;code&gt;inchi&lt;/code&gt;, &lt;code&gt;iupac&lt;/code&gt;. Use the &lt;code&gt;full&lt;/code&gt; omnibus flag for everything.&lt;/p&gt;

&lt;p&gt;Practical effects:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;WASM bundle size shrinks with every unused feature&lt;/li&gt;
&lt;li&gt;Compile time drops when excluding heavy modules like 3D or fingerprints&lt;/li&gt;
&lt;li&gt;Embedded targets can skip unnecessary functionality&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  3D Coordinate Generation and Force Field Minimization in Pure Rust
&lt;/h2&gt;

&lt;p&gt;3D cheminformatics requires distance geometry for initial embedding, force field parameterization, and structure optimization. chematic implements three force fields—UFF, DREIDING, and MMFF94—entirely in Rust.&lt;/p&gt;

&lt;p&gt;The pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate 3D coordinates via distance geometry&lt;/li&gt;
&lt;li&gt;Embed into conformer ensemble&lt;/li&gt;
&lt;li&gt;Minimize geometry using selected force field&lt;/li&gt;
&lt;li&gt;Prune similar conformers by RMSD&lt;/li&gt;
&lt;li&gt;Optionally run short molecular dynamics (300 K, Berendsen thermostat)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;You can also compute shape descriptors (principal moments of inertia, normalized principal ratios, asphericity) and export to PDB or XYZ format. This same 3D engine powers the demo's 3D Viewer and Dynamics tabs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Drug Discovery Pipeline in ~20 Lines
&lt;/h2&gt;

&lt;p&gt;The typical workflow compresses into a few method calls:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In a single Rust binary:&lt;/span&gt;
&lt;span class="c1"&gt;// - 70+ molecular descriptors (MW, LogP, TPSA, Fsp3, exact mass, aromatic ring count, etc.)&lt;/span&gt;
&lt;span class="c1"&gt;// - Drug-likeness filtering&lt;/span&gt;
&lt;span class="c1"&gt;// - Standardization (tautomers, salts, charges)&lt;/span&gt;
&lt;span class="c1"&gt;// - Stereochemistry handling (R/S and E/Z assignment, isomer enumeration)&lt;/span&gt;
&lt;span class="c1"&gt;// - BRICS fragmentation for lead optimization&lt;/span&gt;
&lt;span class="c1"&gt;// - Fingerprints (ECFP, FCFP, MACCS, AtomPair, Torsion)&lt;/span&gt;
&lt;span class="c1"&gt;// - SMARTS and substructure search&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;All linked statically. All open source. No external services.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fingerprints and Similarity-Based Screening
&lt;/h2&gt;

&lt;p&gt;Fingerprints encode molecules as bit vectors for bulk similarity comparison and identification of structurally related compounds. chematic implements six fingerprint algorithms in Rust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ECFP&lt;/strong&gt; and &lt;strong&gt;FCFP&lt;/strong&gt; — circular fingerprints (radii 2, 4, 6)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MACCS&lt;/strong&gt; — 166-bit key set (MDL standard)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AtomPair&lt;/strong&gt; and &lt;strong&gt;Torsion&lt;/strong&gt; — environment-based structural bits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;TopoPF&lt;/strong&gt; — RDKit-style topological path fingerprint&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The demo's Similarity tab shows all six with their Tanimoto scores. Use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Virtual screening&lt;/strong&gt; — find actives from a library of millions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scaffold hopping&lt;/strong&gt; — discover chemically novel analogs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compound clustering&lt;/strong&gt; — group similar leads for follow-up synthesis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hit triage&lt;/strong&gt; — prioritize candidates by structural diversity&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2D SVG Molecular Depiction
&lt;/h2&gt;

&lt;p&gt;chematic generates publication-quality SVG output with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPK atom coloring&lt;/strong&gt; (carbon gray, oxygen red, nitrogen blue, etc.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Substructure highlighting&lt;/strong&gt; for SMARTS matches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reaction scheme visualization&lt;/strong&gt; with arrow flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stereochemistry wedges&lt;/strong&gt; — proper depiction of wedge/dash bonds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customizable styling&lt;/strong&gt; — line width, font, colors&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 2D Viewer tab renders every molecule as SVG, with live SMARTS-based highlighting. This eliminates the need for external rendering tools or server-side chemistry services.&lt;/p&gt;

&lt;h2&gt;
  
  
  3D Interactive Viewer and Molecular Dynamics
&lt;/h2&gt;

&lt;p&gt;chematic includes a full 3D viewer powered by WebGL (via Three.js). The demo's 3D Viewer tab offers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multiple display modes&lt;/strong&gt;: Stick, Ball &amp;amp; Stick, Spacefill (van der Waals surface)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interactive rotation and zoom&lt;/strong&gt; — drag to rotate, scroll to zoom&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time molecular dynamics&lt;/strong&gt; — short MD runs at 300 K with live coordinate updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Conformer ensemble support&lt;/strong&gt; — explore multiple low-energy structures&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PDB/XYZ export&lt;/strong&gt; — save optimized 3D structures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Use cases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ADMET prediction&lt;/strong&gt; — shape-based descriptors from 3D coordinates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Protein-ligand docking&lt;/strong&gt; — preparation and validation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lead optimization&lt;/strong&gt; — visual inspection of 3D conformations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Educational use&lt;/strong&gt; — visualizing stereochemistry and molecular motion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Where to Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Live demo&lt;/strong&gt;: &lt;a href="https://kent-tokyo.github.io/chematic/" rel="noopener noreferrer"&gt;https://kent-tokyo.github.io/chematic/&lt;/a&gt; — try all 8 tabs now in your browser&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get the code&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;crates.io&lt;/strong&gt;: &lt;a href="https://crates.io/crates/chematic" rel="noopener noreferrer"&gt;https://crates.io/crates/chematic&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;npm&lt;/strong&gt;: &lt;code&gt;npm install @kent-tokyo/chematic&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repository&lt;/strong&gt;: &lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/chematic&lt;/a&gt; (contributions welcome)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Docs&lt;/strong&gt;: &lt;a href="https://docs.rs/chematic/0.1.89/chematic/" rel="noopener noreferrer"&gt;https://docs.rs/chematic/0.1.89/chematic/&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I released v0.1.89 in June 2026. The library has 1,521 passing tests. The goal is feature parity with RDKit, achievable purely in Rust without compromising performance or safety.&lt;/p&gt;

&lt;p&gt;If you're building cheminformatics tools in Rust, exploring serverless drug discovery pipelines, or shipping molecular analysis to the browser, try chematic. Feedback and contributions on GitHub are welcome.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>cheminformatics</category>
      <category>webassembly</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Building chematic: A Pure-Rust Cheminformatics Library for WebAssembly</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sun, 07 Jun 2026 08:52:37 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/building-a-pure-rust-cheminformatics-library-for-webassembly-373o</link>
      <guid>https://dev.to/kent-tokyo/building-a-pure-rust-cheminformatics-library-for-webassembly-373o</guid>
      <description>&lt;p&gt;Many cheminformatics tools, including &lt;a href="https://www.rdkit.org/" rel="noopener noreferrer"&gt;RDKit&lt;/a&gt; and &lt;a href="https://openbabel.org/" rel="noopener noreferrer"&gt;Open Babel&lt;/a&gt;, are written in C or C++. When they are used from JavaScript, they are often compiled through Emscripten or exposed through generated bindings. That works, but it usually brings a larger native toolchain into the build and makes package size a constant concern.&lt;/p&gt;

&lt;p&gt;What happens if you remove &lt;strong&gt;all C and C++&lt;/strong&gt; from that stack?&lt;/p&gt;

&lt;p&gt;That is the idea behind &lt;code&gt;chematic&lt;/code&gt;. I wrote a cheminformatics library from scratch in pure Rust and published it as a WASM-backed npm package. The optimized core WASM artifact is &lt;strong&gt;about 550 KB&lt;/strong&gt; in my release build. The Rust side is Cargo-based; npm packaging still uses the usual WASM binding and optimization steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Pure Rust?
&lt;/h2&gt;

&lt;p&gt;Cheminformatics algorithms are complicated. Ring perception in molecular graphs, stereochemistry assignment, fingerprint generation: all of these rely heavily on careful graph manipulation. Existing C++ libraries are powerful and deeply validated, but I wanted this project to keep the memory-safety boundary auditable across the whole stack.&lt;/p&gt;

&lt;p&gt;Rust was attractive for exactly the opposite reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  Native Fit for WASM
&lt;/h3&gt;

&lt;p&gt;Rust's WASM support is integrated into the standard toolchain. The &lt;code&gt;wasm32-unknown-unknown&lt;/code&gt; target ships with Rust, and &lt;code&gt;cargo build --target wasm32-unknown-unknown&lt;/code&gt; is enough to produce a WASM artifact.&lt;/p&gt;

&lt;p&gt;For the Rust crates themselves, there is no need for OS-specific native dependencies or C/C++ build scripts. The JavaScript package still has a packaging step, but the chemistry implementation does not depend on an external native library.&lt;/p&gt;

&lt;h3&gt;
  
  
  Memory Safety as a Constraint
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; uses &lt;strong&gt;no unsafe Rust&lt;/strong&gt;. That is not just a policy. It is a design constraint.&lt;/p&gt;

&lt;p&gt;When working with complex data structures such as molecular graphs, I wanted the implementation to stay inside Rust's normal safety model. I relied on the ownership system and borrow checker, and implemented all operations inside safe Rust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reproducibility and Determinism
&lt;/h3&gt;

&lt;p&gt;For fingerprint calculation, the same molecule must always produce the same bit vector. &lt;code&gt;chematic&lt;/code&gt; uses fixed atom-environment ordering, stable serialization, fixed-width FNV-1a hashing, and avoids randomized hashers.&lt;/p&gt;

&lt;p&gt;The goal is deterministic fingerprints across platforms and runtime environments, which makes similarity scores reproducible and easier to test.&lt;/p&gt;

&lt;h3&gt;
  
  
  Turning Zero FFI from a Choice into a Rule
&lt;/h3&gt;

&lt;p&gt;FFI to C or C++ libraries is useful, but once it is allowed, exceptions start appearing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"InChI needs a C library."&lt;/li&gt;
&lt;li&gt;"This optimization should be written in C."&lt;/li&gt;
&lt;li&gt;"This one dependency is too convenient to avoid."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; forbids FFI itself. &lt;code&gt;rdkit-sys&lt;/code&gt;, &lt;code&gt;openbabel-sys&lt;/code&gt;, &lt;code&gt;cc&lt;/code&gt;, &lt;code&gt;bindgen&lt;/code&gt;: none of them are used. That constraint keeps the implementation coherent and makes the dependency boundary easier to reason about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shape of &lt;code&gt;chematic&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;The project is organized into 13 Rust crates:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chematic/
├── chematic-core       -&amp;gt; basic atom, bond, and molecule types; kekulization; zero dependencies
├── chematic-smiles     -&amp;gt; OpenSMILES parser and canonical SMILES writer
├── chematic-perception -&amp;gt; ring perception (SSSR) and aromaticity detection
├── chematic-mol        -&amp;gt; MOL/SDF file format reading and writing
├── chematic-depict     -&amp;gt; 2D SVG depiction with CPK colors and highlights
├── chematic-chem       -&amp;gt; 40+ molecular descriptors: MW, LogP, TPSA, and more
├── chematic-fp         -&amp;gt; 7 fingerprint types, including ECFP and MACCS
├── chematic-smarts     -&amp;gt; SMARTS parser and substructure search using VF2
├── chematic-3d         -&amp;gt; experimental 3D coordinate generation and force-field minimization
├── chematic-rxn        -&amp;gt; reaction SMILES/SMIRKS parser
├── chematic-wasm       -&amp;gt; WASM bindings for JavaScript and TypeScript
├── chematic-iupac      -&amp;gt; IUPAC name generation in pure Rust, without network access
└── chematic            -&amp;gt; umbrella crate integrating the full set
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At the moment, all &lt;strong&gt;933 tests&lt;/strong&gt; pass. I also ran a ChEMBL 37 validation pass over 2,897,819 molecule records. In that pass, the checked SMILES inputs parsed successfully and satisfied the validation checks used by the test harness.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing the Algorithms
&lt;/h2&gt;

&lt;p&gt;Here are some of the core chemistry-specific algorithms and how they were implemented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ring Perception (SSSR)
&lt;/h3&gt;

&lt;p&gt;Benzene rings, naphthalene rings, and many other molecular properties depend on ring structures. Detecting those rings from a molecular graph is the job of SSSR: the Smallest Set of Smallest Rings.&lt;/p&gt;

&lt;p&gt;A general graph library such as &lt;code&gt;petgraph&lt;/code&gt; can compute cycle bases, but chemistry has stricter definitions of what counts as a ring. Those definitions do not always match the generic graph-theory answer.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; implements an algorithm specialized for chemical requirements. SSSR is not unique for every graph, so the target is not "the one true ring set." The practical goal is stable behavior that matches the selected chemistry test cases and RDKit-style expectations used by the project.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kekulization
&lt;/h3&gt;

&lt;p&gt;When aromatic molecules such as benzene are represented in SMILES, alternating single and double bonds must be assigned. This process is called kekulization. For many common aromatic systems, the implementation can be formulated as a matching problem over the molecular graph, with explicit failure handling for cases that cannot be assigned consistently.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Input:  SMILES "c1ccccc1" (benzene, aromatic notation)&lt;/span&gt;
&lt;span class="c1"&gt;// Output: "C1=CC=CC=C1"   (Kekule form with alternating single and double bonds)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rust's type system helps prevent invalid kekulization states from leaking through the API. It does not make chemistry errors impossible, but it does make illegal intermediate states harder to represent accidentally.&lt;/p&gt;

&lt;h3&gt;
  
  
  CIP Stereochemistry
&lt;/h3&gt;

&lt;p&gt;CIP rules, the Cahn-Ingold-Prelog rules, determine R/S configuration for chiral centers and E/Z configuration for double bonds. The priority assignment algorithm is complex: each atom receives a priority, that priority can depend on its neighbors, and the effect propagates recursively through the graph.&lt;/p&gt;

&lt;p&gt;The current implementation covers the subset needed by the library's stereochemistry tests, including common tetrahedral and double-bond cases. More obscure CIP edge cases, such as advanced pseudoasymmetry handling, should be treated as an area for continued validation rather than a solved claim.&lt;/p&gt;

&lt;h3&gt;
  
  
  ECFP Fingerprints (Morgan Algorithm)
&lt;/h3&gt;

&lt;p&gt;ECFP fingerprints convert the local environment around atoms into hashed identifiers, which are then used to calculate molecular similarity. The algorithm runs for multiple rounds, expanding each atom's neighborhood information and hashing it at every step.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; canonicalizes input order before calculation and uses a fixed-width FNV-1a hash. The important property is not the hash function alone, but the combination of stable ordering, stable serialization, and no randomized hashing. As a result, the same molecule should produce the same bit vector every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scope and Packaging Tradeoffs
&lt;/h2&gt;

&lt;p&gt;This is not a replacement for RDKit. RDKit has decades of chemistry validation and a much broader feature set. The tradeoff in &lt;code&gt;chematic&lt;/code&gt; is narrower scope in exchange for a pure-Rust implementation and a smaller WASM-oriented package.&lt;/p&gt;

&lt;p&gt;The numbers below are intended as project-level packaging context, not a benchmark. WASM and npm sizes vary depending on build options, exported APIs, JavaScript glue, compression, and optimization settings.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Item&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;chematic&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;RDKit.js / Open Babel style stacks&lt;/th&gt;
&lt;th&gt;OCL.js / Indigo-style stacks&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Native dependency in the chemistry core&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Usually yes&lt;/td&gt;
&lt;td&gt;Depends on project&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;WASM/package size profile&lt;/td&gt;
&lt;td&gt;Small optimized core artifact in my build&lt;/td&gt;
&lt;td&gt;Often larger because mature native libraries expose broad functionality&lt;/td&gt;
&lt;td&gt;Usually depends heavily on exported feature set&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build model&lt;/td&gt;
&lt;td&gt;Cargo-based Rust crates plus WASM packaging&lt;/td&gt;
&lt;td&gt;Native build toolchain plus generated bindings&lt;/td&gt;
&lt;td&gt;Project-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Feature coverage&lt;/td&gt;
&lt;td&gt;Focused subset implemented in Rust&lt;/td&gt;
&lt;td&gt;Much broader, especially RDKit&lt;/td&gt;
&lt;td&gt;Broader in some areas, different scope in others&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;InChI/InChIKey&lt;/td&gt;
&lt;td&gt;No, out of scope&lt;/td&gt;
&lt;td&gt;Often available&lt;/td&gt;
&lt;td&gt;Often available&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The standard/reference &lt;a href="https://www.inchi-trust.org/" rel="noopener noreferrer"&gt;InChI&lt;/a&gt; implementation is written in C. I chose not to wrap it because that would violate the no-FFI constraint. &lt;code&gt;chematic&lt;/code&gt; documents InChI/InChIKey support as out of scope.&lt;/p&gt;

&lt;p&gt;Instead, it covers many use cases through canonical SMILES, Murcko scaffolds, and molecular graph isomorphism.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use Cases
&lt;/h3&gt;

&lt;p&gt;This library is useful in places where a lightweight browser-side cheminformatics implementation is more important than full RDKit parity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser-side chemical database filtering&lt;/strong&gt;: fingerprint similarity and descriptor filters that can run directly on an end user's machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prototype screening UIs&lt;/strong&gt;: Lipinski rules, QED, SA score, and other descriptors calculated from a web interface without a backend round trip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Teaching and visualization workflows&lt;/strong&gt;: molecule parsing, depiction, and experimental 3D generation for interactive demos.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Lightweight SAR exploration&lt;/strong&gt;: extracting simple chemical-change patterns in desktop or browser applications.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Development Progress
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; was developed in phases, from Phase 1 for the foundation to Phase 15 for extended functionality. It is now at v0.1.32 and still actively maintained.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Phase 1-6: core functionality: parsing, ring perception, descriptors, fingerprints, and RDKit compatibility&lt;/li&gt;
&lt;li&gt;Phase 7-9: extended descriptors and diversity selection: EState, VSA, SA score, MaxMin, and Butina&lt;/li&gt;
&lt;li&gt;Phase 10-15: mutable APIs, 2D stereochemistry, reaction formats, and IUPAC naming&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At each stage, I repeated compatibility testing against &lt;a href="https://www.ebi.ac.uk/chembl/" rel="noopener noreferrer"&gt;ChEMBL&lt;/a&gt;. The ChEMBL 37 validation pass covered 2,897,819 molecule records and checked that the input structures used by the harness could be parsed and processed without failing those validation checks. This should be read as parser and pipeline validation, not as proof that every descriptor or stereochemistry result is scientifically equivalent to RDKit.&lt;/p&gt;

&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;The main limitation is scope. &lt;code&gt;chematic&lt;/code&gt; is designed around a no-FFI pure-Rust constraint, so it intentionally does not expose everything available in mature cheminformatics systems.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;InChI and InChIKey are out of scope because wrapping the reference implementation would require FFI.&lt;/li&gt;
&lt;li&gt;3D coordinate generation and force-field minimization are useful for demos and exploratory workflows, but they should not be treated as production-grade molecular modeling validation.&lt;/li&gt;
&lt;li&gt;CIP stereochemistry support covers the common cases tested by the project, but rare edge cases need more validation.&lt;/li&gt;
&lt;li&gt;RDKit compatibility is a testing target for selected behavior, not a claim of full RDKit equivalence.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Current Status and Demo
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Live Demo
&lt;/h3&gt;

&lt;p&gt;Descriptor calculation, fingerprint similarity, drug-likeness rules, a 3D viewer, reaction schemes, and more run in the browser through WASM:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://kent-tokyo.github.io/chematic/" rel="noopener noreferrer"&gt;https://kent-tokyo.github.io/chematic/&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  JavaScript and TypeScript Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; @kent-tokyo/chematic
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;init&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;get_descriptors_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;brics_fragments_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;enumerate_stereo_isomers_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tanimoto_ecfp4&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@kent-tokyo/chematic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// aspirin&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;molecular_weight&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt; &lt;span class="c1"&gt;// ~180.16&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;qed&lt;/span&gt;&lt;span class="p"&gt;());&lt;/span&gt;              &lt;span class="c1"&gt;// drug-likeness [0,1]&lt;/span&gt;

&lt;span class="c1"&gt;// Get multiple descriptors at once&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get_descriptors_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`MW: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mw&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, TPSA: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tpsa&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, LogP: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;logP&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// BRICS fragmentation for decomposing molecules&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;frags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;brics_fragments_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Fragment count: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;frags&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Stereoisomer enumeration&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;isomers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;enumerate_stereo_isomers_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;C(F)(Cl)Br&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)));&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Possible stereoisomers: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;isomers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Fingerprint similarity for screening&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;caffeine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Cn1cnc2c1c(=O)n(c(=O)n2C)C&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;aspirin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CC(=O)Oc1ccccc1C(=O)O&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Similarity: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nf"&gt;tanimoto_ecfp4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;caffeine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;aspirin&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;More than 100 WASM API endpoints are exposed, and TypeScript definitions are generated automatically, so IDE completion works out of the box.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rust Usage
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;chematic&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="py"&gt;version&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.1.32"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="py"&gt;features&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"smiles"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"fp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"chem"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;smiles&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;chematic&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;ecfp4&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;aspirin&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"CC(=O)Oc1ccccc1C(=O)O"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;caffeine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;parse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Cn1cnc2c1c(=O)n(c(=O)n2C)C"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.unwrap&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;ecfp4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;aspirin&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.tanimoto&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nf"&gt;ecfp4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;caffeine&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Similarity: {:.3}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// ~0.4&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Testing
&lt;/h3&gt;

&lt;p&gt;All 933 unit tests pass. The ChEMBL 37 validation pass over 2,897,819 molecule records also completes successfully under the parser and pipeline checks described above.&lt;/p&gt;

&lt;p&gt;That does not just mean "the code compiles." It means the implementation has been tested against a large real-world chemical database. The reported WASM binary size is measured for the optimized core artifact after &lt;code&gt;wasm-opt&lt;/code&gt;; compressed transfer size is smaller, while npm package size depends on generated JavaScript and TypeScript files.&lt;/p&gt;




&lt;p&gt;&lt;code&gt;chematic&lt;/code&gt; implements a focused set of chemical information processing features, from molecular representation to similarity calculation and experimental 3D structure generation, using pure Rust and Rust's native WASM support.&lt;/p&gt;

&lt;p&gt;It is still under active development.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kent-tokyo/chematic" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/chematic&lt;/a&gt;&lt;/p&gt;

</description>
      <category>rust</category>
      <category>webassembly</category>
      <category>opensource</category>
      <category>chemistry</category>
    </item>
    <item>
      <title>What I Do Before Letting Claude Code Touch Web App Design</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Wed, 03 Jun 2026 13:23:08 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/what-i-do-before-letting-claude-code-touch-web-app-design-53n0</link>
      <guid>https://dev.to/kent-tokyo/what-i-do-before-letting-claude-code-touch-web-app-design-53n0</guid>
      <description>&lt;p&gt;Claude Code writes code well, but "make it look nice" as a standalone instruction produces inconsistent results. Anthropic's &lt;a href="https://code.claude.com/docs/en/best-practices" rel="noopener noreferrer"&gt;best practices guide&lt;/a&gt; recommends separating exploration and planning from implementation, and using screenshots as verification signals rather than eyeballing the code. I extended that principle to the entire design process. Here is the sequence I use.&lt;/p&gt;




&lt;h2&gt;
  
  
  Before writing any code
&lt;/h2&gt;

&lt;p&gt;Starting with code means design changes become code changes. If you decide later that you want a different layout, you end up dismantling components that are already written.&lt;/p&gt;

&lt;p&gt;I do two things before touching code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get the spec in Markdown&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Write a list of all screens in this app and the UI elements each screen needs.
No code yet.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code writes out the screen inventory, per-screen elements, and data flow in prose. Deciding the layout direction here — sidebar or not, fixed header or not — reduces "actually, let's change this" iterations later.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get the component tree as text&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Based on the screen layout, write the parent-child relationships between components as a tree.
Include the props each component receives and the state it holds.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Keep the output as text, not code. Reviewing the structure before implementation makes it easier to trace later why things ended up the way they did.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build the design system first
&lt;/h2&gt;

&lt;p&gt;Before creating a single component, lock in the rules for color, typography, and spacing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Create&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;base&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;design&lt;/span&gt; &lt;span class="nx"&gt;system&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;these&lt;/span&gt; &lt;span class="nx"&gt;constraints&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Framework&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;Tailwind&lt;/span&gt; &lt;span class="nx"&gt;CSS&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Color&lt;/span&gt; &lt;span class="nx"&gt;palette&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;primary&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;blue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;include&lt;/span&gt; &lt;span class="nx"&gt;grayscale&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Font&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;system&lt;/span&gt; &lt;span class="nx"&gt;font&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;Define&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;CSS&lt;/span&gt; &lt;span class="nx"&gt;custom&lt;/span&gt; &lt;span class="nx"&gt;properties&lt;/span&gt;

&lt;span class="nx"&gt;No&lt;/span&gt; &lt;span class="nx"&gt;component&lt;/span&gt; &lt;span class="nx"&gt;code&lt;/span&gt; &lt;span class="nx"&gt;yet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="nx"&gt;Create&lt;/span&gt; &lt;span class="nx"&gt;only&lt;/span&gt; &lt;span class="nx"&gt;tailwind&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ts&lt;/span&gt; &lt;span class="nx"&gt;and&lt;/span&gt; &lt;span class="nx"&gt;globals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;css&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Skipping this leads to hardcoded color values scattered across components. Unifying &lt;code&gt;text-blue-500&lt;/code&gt; to &lt;code&gt;text-primary&lt;/code&gt; everywhere afterward is tedious.&lt;/p&gt;

&lt;p&gt;Without Tailwind, CSS custom properties work as a substitute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nd"&gt;:root&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="py"&gt;--color-primary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;#2563eb&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="py"&gt;--color-surface&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;#f8fafc&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="py"&gt;--spacing-base&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.25rem&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once the design direction is settled, I put it in &lt;code&gt;DESIGN.md&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;DESIGN.md&lt;/code&gt; follows a format &lt;a href="https://github.com/google-labs-code/design.md" rel="noopener noreferrer"&gt;open-sourced by Google Stitch&lt;/a&gt;, readable by Claude Code, Cursor, Copilot, and other agents. The structure has two layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;colors&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;primary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#2563eb"&lt;/span&gt;
  &lt;span class="na"&gt;surface&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;#f8fafc"&lt;/span&gt;
&lt;span class="na"&gt;typography&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;base&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;16px"&lt;/span&gt;
  &lt;span class="na"&gt;scale&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1.25&lt;/span&gt;
&lt;span class="na"&gt;spacing&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;unit&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;4px"&lt;/span&gt;
&lt;span class="na"&gt;components&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;library&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;shadcn/ui"&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="c1"&gt;## Design rationale&lt;/span&gt;

&lt;span class="s"&gt;Primary color is blue for trust. Colors are chosen to meet WCAG AA contrast ratio (4.5:1 minimum for text on background).&lt;/span&gt;
&lt;span class="s"&gt;Responsive is mobile-first, using only the `md` breakpoint.&lt;/span&gt;
&lt;span class="na"&gt;Reference design&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Stripe dashboard (high information density, minimal whitespace).&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The YAML frontmatter holds machine-readable tokens (color, font, spacing). The Markdown body holds the rationale — the "why" behind each value. Agents fill in edge cases from the rationale when tokens alone are not enough.&lt;/p&gt;

&lt;p&gt;Add &lt;code&gt;@DESIGN.md&lt;/code&gt; to &lt;code&gt;CLAUDE.md&lt;/code&gt; to have Claude Code reference the design direction across sessions. No need to repeat it in every prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prompts for creating components
&lt;/h2&gt;

&lt;p&gt;Design instructions land better when they say "what to reference" rather than "how it should look."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Less effective&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Make it modern and clean.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Claude Code's "modern" and your "modern" are different things.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More effective&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Build this in a style close to shadcn/ui's Card component.
Use rounded-lg for border radius, shadow-sm for shadow, p-6 as the base padding.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Design like the Stripe dashboard — high information density,
minimal whitespace, small font sizes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Naming a reference directly reduces variation in the output. Using shadcn/ui or Radix UI component names directly works for the same reason.&lt;/p&gt;

&lt;h3&gt;
  
  
  Specify interaction states from the start
&lt;/h3&gt;

&lt;p&gt;Specifying only the default state leaves hover, disabled, and error states undefined. Listing all states up front avoids "the disabled state doesn't look right" fixes later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Create&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;Button&lt;/span&gt; &lt;span class="nx"&gt;component&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="nx"&gt;Implement&lt;/span&gt; &lt;span class="nx"&gt;all&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;these&lt;/span&gt; &lt;span class="nx"&gt;states&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;primary&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt; &lt;span class="nx"&gt;background&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;white&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;hover&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;slightly&lt;/span&gt; &lt;span class="nf"&gt;darker &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;darken&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;disabled&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;grayed&lt;/span&gt; &lt;span class="nx"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;cursor&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;not&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;allowed&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;loading&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;replace&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="kd"&gt;with&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="nx"&gt;spinner&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;non&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nx"&gt;clickable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For form input fields, specify the error state (red border, error message visible) at the same time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dark mode
&lt;/h3&gt;

&lt;p&gt;Decide at the start whether to build dark mode in from the beginning or handle it later. Asking for dark mode after the fact means rewriting all the hardcoded light-mode values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;&lt;span class="nt"&gt;Add&lt;/span&gt; &lt;span class="nt"&gt;dark&lt;/span&gt; &lt;span class="nt"&gt;mode&lt;/span&gt; &lt;span class="nt"&gt;support&lt;/span&gt; &lt;span class="nt"&gt;using&lt;/span&gt; &lt;span class="nt"&gt;Tailwind&lt;/span&gt;&lt;span class="err"&gt;'&lt;/span&gt;&lt;span class="nt"&gt;s&lt;/span&gt; &lt;span class="nt"&gt;dark&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="nt"&gt;classes&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="nt"&gt;The&lt;/span&gt; &lt;span class="nt"&gt;CSS&lt;/span&gt; &lt;span class="nt"&gt;variables&lt;/span&gt; &lt;span class="nt"&gt;in&lt;/span&gt; &lt;span class="nt"&gt;DESIGN&lt;/span&gt;&lt;span class="nc"&gt;.md&lt;/span&gt; &lt;span class="nt"&gt;already&lt;/span&gt; &lt;span class="nt"&gt;have&lt;/span&gt; &lt;span class="nt"&gt;both&lt;/span&gt; &lt;span class="nt"&gt;light&lt;/span&gt; &lt;span class="nt"&gt;and&lt;/span&gt; &lt;span class="nt"&gt;dark&lt;/span&gt; &lt;span class="nt"&gt;definitions&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;Reference&lt;/span&gt; &lt;span class="nt"&gt;those&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Feedback loop
&lt;/h2&gt;

&lt;p&gt;Once a component is working, I give feedback based on screenshots rather than reading through the code. The official best practices guide covers this pattern too: compare a screenshot against the target design and iterate from there.&lt;/p&gt;

&lt;p&gt;I use &lt;code&gt;/run&lt;/code&gt; to start the server, then use a screenshot tool (the &lt;code&gt;frontend-design&lt;/code&gt; skill or Playwright MCP) to capture the screen before giving feedback.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Take a screenshot of the current home screen.
The header is too tall. Set it to h-16.
Bring the navigation font size down to text-sm.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Packing too many changes into one instruction makes it hard to check what changed. I aim for two or three visually verifiable changes per round.&lt;/p&gt;

&lt;p&gt;Responsive checks follow the same flow.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Take a screenshot at mobile width (375px).
The cards are probably still displaying side by side.
Fix them to stack vertically on mobile.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  When regressions happen
&lt;/h3&gt;

&lt;p&gt;After enough iterations, "fixing one thing broke another" will come up. Asking Claude Code to compare the previous screenshot to the current state helps narrow down the cause.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Compare the previous screenshot to the current screen.
List any changes that happened unintentionally. No fixes yet.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Understanding the cause before applying a fix avoids repeating the same patch cycle.&lt;/p&gt;




&lt;h2&gt;
  
  
  Things that kept going wrong for me
&lt;/h2&gt;

&lt;p&gt;The drift problem bit me the most: "change the header font" would come back with the entire layout reorganized. Now I add an explicit boundary to any scoped change.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;Change&lt;/span&gt; &lt;span class="nx"&gt;only&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;font&lt;/span&gt; &lt;span class="nx"&gt;size&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;Header&lt;/span&gt; &lt;span class="nx"&gt;component&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;span class="nx"&gt;Do&lt;/span&gt; &lt;span class="nx"&gt;not&lt;/span&gt; &lt;span class="nx"&gt;touch&lt;/span&gt; &lt;span class="nx"&gt;the&lt;/span&gt; &lt;span class="nx"&gt;layout&lt;/span&gt; &lt;span class="nx"&gt;or&lt;/span&gt; &lt;span class="nx"&gt;colors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I also tried generating all components in one pass a few times. It produced a lot of code quickly, but I'd invariably find structural issues only after everything was already built on top of that foundation. Building one component, confirming it, then moving on turned out to be faster overall even though it felt slower.&lt;/p&gt;

&lt;p&gt;Vague feedback is the other thing that stalls progress. "The overall balance is wrong" is not a prompt Claude Code can act on. Taking a screenshot and naming specifics — "the left column is too wide relative to the right," "the font sizes look inconsistent between cards" — gets an actionable result.&lt;/p&gt;




&lt;p&gt;The full sequence looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Get the spec in Markdown
2. Get the component structure as text
3. Build the design system (tailwind.config / globals.css)
4. Implement components one at a time
5. Review screenshots and iterate
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Skipping steps 1 to 3 and jumping into implementation consistently led to more rework later. Getting the structure right before writing code produced fewer revisions, at least in my experience.&lt;/p&gt;

</description>
      <category>claudecode</category>
      <category>webdev</category>
      <category>design</category>
      <category>ai</category>
    </item>
    <item>
      <title>Japanese SDS vs EU and US</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sat, 30 May 2026 07:50:45 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/japanese-sds-vs-eu-and-us-46l2</link>
      <guid>https://dev.to/kent-tokyo/japanese-sds-vs-eu-and-us-46l2</guid>
      <description>&lt;p&gt;The EU operates under CLP Regulation — its own implementation of GHS — while the US is mid-transition to OSHA HazCom 2024, based on GHS Revision 7. Japanese SDSs aligned to JIS Z 7253:2019 diverge from both in ways that go beyond wording: classification criteria differ, mandatory fields are missing, and some sections require a full rewrite rather than any form of translation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Comparison of Requirements Across Three Regions
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Region&lt;/th&gt;
&lt;th&gt;Standard&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;GHS Base&lt;/th&gt;
&lt;th&gt;Key Characteristics&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;JIS Z 7253:2019&lt;/td&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;Revision 6 (Revision 9-aligned :2025 edition to be issued December 2025)&lt;/td&gt;
&lt;td&gt;National standard with some hazard categories not adopted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EU&lt;/td&gt;
&lt;td&gt;CLP Regulation (EC) No 1272/2008&lt;/td&gt;
&lt;td&gt;Official language of the destination member state&lt;/td&gt;
&lt;td&gt;Independent implementation referencing GHS&lt;/td&gt;
&lt;td&gt;REACH registration number and UFI required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;OSHA HazCom 2012→2024&lt;/td&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;Revision 3→Revision 7 (transition in progress)&lt;/td&gt;
&lt;td&gt;OSHA PEL, ACGIH TLV, and NIOSH REL listed together&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The UK operates its own GB CLP after Brexit (retained EU CLP law), diverging from EU CLP over time. Canada introduced WHMIS 2015 (GHS Revision 5 base) and updated it to a GHS Revision 7/8 base in December 2022. This article focuses on the EU and US.&lt;/p&gt;

&lt;h2&gt;
  
  
  EU CLP and Its Relationship to GHS
&lt;/h2&gt;

&lt;p&gt;CLP Regulation does not adopt a specific GHS revision. It operates as an independent regulation that references GHS — the GHS revision number does not map directly to CLP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harmonised Classification (Annex VI):&lt;/strong&gt; Annex VI of CLP lists harmonised classifications decided by ECHA for specific substances. Where a harmonised classification exists, it applies by mandate — a supplier's self-classification cannot override it. These harmonised classifications do not always align with JIS classifications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;New Hazard Classes:&lt;/strong&gt; Commission Delegated Regulation (EU) 2023/707 (in force April 2023) and Regulation (EU) 2024/2865 (in force December 2024) introduced hazard classes that do not exist in GHS:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Endocrine disruption (ED Category 1 and 2): human health&lt;/li&gt;
&lt;li&gt;Endocrine disruption (ED Category 1 and 2): environment&lt;/li&gt;
&lt;li&gt;Persistent, Bioaccumulative, and Toxic (PBT / vPvB)&lt;/li&gt;
&lt;li&gt;Persistent, Mobile, and Toxic (PMT / vPvM)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these classes exist in JIS Z 7253 or HazCom.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transition Schedule (as of May 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Substances&lt;/strong&gt; first placed on the market from 1 May 2025 onward: new hazard classes apply&lt;/li&gt;
&lt;li&gt;Existing substances (in the supply chain): grace period until 1 November 2026&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Mixtures&lt;/strong&gt; first placed on the market from 1 May 2026 onward: new hazard classes apply&lt;/li&gt;
&lt;li&gt;Existing mixtures (in the supply chain): grace period until 1 May 2028&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Mixtures first entering the EU market from May 2026 already require compliance with the new hazard classes.&lt;/p&gt;

&lt;h2&gt;
  
  
  US HazCom 2024 Transition
&lt;/h2&gt;

&lt;p&gt;On 20 May 2024, OSHA published the final rule for HazCom 2024, updating the Hazard Communication Standard to align with GHS Revision 7. In January 2026, OSHA extended the compliance deadlines to allow time for regulatory guidance preparation.&lt;/p&gt;

&lt;p&gt;The main changes include a new category for desensitized explosives, revised H-statement and P-statement text, and additional SDS requirements.&lt;/p&gt;

&lt;p&gt;Compliance deadlines are phased and differ by role — manufacturers, importers, distributors, and downstream users have separate timelines. Check OSHA's official source for the current deadlines.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; HazCom 2024 deadlines were revised by OSHA's January 2026 extension notice. Current effective dates are listed at &lt;a href="https://www.osha.gov/hazcom/effective-dates" rel="noopener noreferrer"&gt;OSHA HazCom effective dates&lt;/a&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Classification Differences
&lt;/h2&gt;

&lt;p&gt;Consider Acute Toxicity Category 5 (oral LD50: 2,000–5,000 mg/kg):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Region/Standard&lt;/th&gt;
&lt;th&gt;Acute Toxicity (oral) Category 5&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japan (JIS Z 7253)&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Treated as not classified&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EU (CLP)&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Independent decision from GHS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;US (HazCom 2012/2024)&lt;/td&gt;
&lt;td&gt;Optional (not mandatory)&lt;/td&gt;
&lt;td&gt;Manufacturer's discretion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Japan, the EU, and the US are aligned in not mandating Category 5, which differs significantly from China (GB 30000.1-2024).&lt;/p&gt;

&lt;p&gt;Categories and classes where adoption diverges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hazard Class&lt;/th&gt;
&lt;th&gt;Japan&lt;/th&gt;
&lt;th&gt;EU&lt;/th&gt;
&lt;th&gt;US&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Acute Toxicity Category 5&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Skin Irritation Category 3&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Adopted&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aspiration Hazard Category 2&lt;/td&gt;
&lt;td&gt;Not adopted&lt;/td&gt;
&lt;td&gt;Adopted&lt;/td&gt;
&lt;td&gt;Optional&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Endocrine Disruption (ED)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Categories 1 &amp;amp; 2 (from 2023)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PBT / PMT&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes (from 2023)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The EU adopts Skin Irritation Category 3 and Aspiration Hazard Category 2. Substances listed as not classified in a Japanese SDS may require classification under EU CLP.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Translation Fails
&lt;/h2&gt;

&lt;p&gt;Of the 16 SDS sections, a handful can be carried over with translation; most require re-evaluation against local standards.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Translatable?&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Section 2 — Hazard Identification&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;EU harmonised classification applies by mandate; HazCom 2024 classification criteria differ&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H-statements and P-statements&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Text changes between GHS revisions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 3 — Composition/Ingredient Information&lt;/td&gt;
&lt;td&gt;Verify&lt;/td&gt;
&lt;td&gt;EU requires REACH registration number and EC number&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 8 — OEL Values&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;OSHA PEL, ACGIH TLV, NIOSH REL, and EU OELVs are separate values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 15 — Regulatory Information&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Translating "Industrial Safety and Health Act (労働安全衛生法)" yields a law name that does not exist in EU or US regulatory frameworks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 1 — Emergency Contact&lt;/td&gt;
&lt;td&gt;Verify&lt;/td&gt;
&lt;td&gt;EU: UFI required. US: local emergency contact number required&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sections 4–7, 9–12&lt;/td&gt;
&lt;td&gt;Mostly yes&lt;/td&gt;
&lt;td&gt;Physical data and procedures are largely transferable, though Section 11 must align with Section 2&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Some typical problems when applying a straight translation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Action&lt;/th&gt;
&lt;th&gt;What appears in the SDS&lt;/th&gt;
&lt;th&gt;Problem&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Translate Section 15 "労働安全衛生法" to English&lt;/td&gt;
&lt;td&gt;Industrial Safety and Health Act&lt;/td&gt;
&lt;td&gt;This law does not exist in EU or US regulatory frameworks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Translate H302 "飲み込むと有害" to English&lt;/td&gt;
&lt;td&gt;Harmful if swallowed&lt;/td&gt;
&lt;td&gt;HazCom 2024 has revised the text of some H-statements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Carry over Japanese control concentrations (管理濃度, mandatory administrative exposure values under Japan's Industrial Safety and Health Act) as-is&lt;/td&gt;
&lt;td&gt;50 ppm (Japan 管理濃度)&lt;/td&gt;
&lt;td&gt;OSHA PEL, ACGIH TLV, and NIOSH REL are separate values&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Translate "Skin irritation: not classified" to English&lt;/td&gt;
&lt;td&gt;Not classified&lt;/td&gt;
&lt;td&gt;EU CLP classifies some of these substances as Category 3&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Omit REACH information&lt;/td&gt;
&lt;td&gt;(no entry)&lt;/td&gt;
&lt;td&gt;EU requires REACH registration number and UFI&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Do not translate H-statement and P-statement text&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Translating the Japanese text of H302 ("飲み込むと有害") into English may not match the official HazCom 2024 wording. Look up the H- or P-statement number and use the official text from the destination country or region. Because HazCom 2024 revised some H-statements and P-statements, the GHS Revision 3 text used in HazCom 2012 SDSs cannot simply be carried forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Section 15 requires a complete rewrite, not a translation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;"Industrial Safety and Health Act" — the literal English rendering of 労働安全衛生法 — does not correspond to any EU or US statute. For EU SDSs, list REACH Regulation, CLP Regulation, and the implementing legislation of the destination member state. For US SDSs, list the OSHA Hazard Communication Standard (29 CFR 1910.1200), TSCA, and SARA Title III.&lt;/p&gt;

&lt;h2&gt;
  
  
  Section 8: OEL Comparison
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Region&lt;/th&gt;
&lt;th&gt;Reference Standard&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Legal Force&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;Control concentrations (管理濃度, mandatory administrative exposure values, appended to Industrial Safety and Health Ordinance); JSOH (Japan Society for Occupational Health) recommended values&lt;/td&gt;
&lt;td&gt;TWA equivalent&lt;/td&gt;
&lt;td&gt;Control concentrations are mandatory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EU&lt;/td&gt;
&lt;td&gt;EU Occupational Exposure Limit Values (OELVs); national OELs of individual member states&lt;/td&gt;
&lt;td&gt;TWA, STEL&lt;/td&gt;
&lt;td&gt;Most EU OELVs are indicative (non-binding health-based values); Binding OELVs (BOELVs) are mandatory minima. Member states must establish national OELs at least as strict as BOELVs.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;OSHA PEL (29 CFR 1910 Subpart Z)&lt;/td&gt;
&lt;td&gt;TWA&lt;/td&gt;
&lt;td&gt;Mandatory (many values date to 1971 and have rarely been updated)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;ACGIH TLV (American Conference of Governmental Industrial Hygienists)&lt;/td&gt;
&lt;td&gt;TLV-TWA, TLV-STEL&lt;/td&gt;
&lt;td&gt;Not legally binding (widely used as the practical standard)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;US&lt;/td&gt;
&lt;td&gt;NIOSH REL (National Institute for Occupational Safety and Health)&lt;/td&gt;
&lt;td&gt;TWA, STEL&lt;/td&gt;
&lt;td&gt;Not legally binding&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;US SDSs in practice list all three — OSHA PEL, ACGIH TLV, and NIOSH REL — together. Many OSHA PELs date to 1971 and are well below current health risk assessments. Japanese control concentrations and JSOH recommended values cannot be substituted for any of these.&lt;/p&gt;

&lt;p&gt;For EU SDSs, confirm both the EU OELVs and the national OELs of the destination member state. EU OELVs set the floor; individual member states may set stricter values.&lt;/p&gt;

&lt;h2&gt;
  
  
  EU-Specific Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;REACH Registration Number&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Substances registered with ECHA under REACH are assigned a registration number (format: 01-XXXXXXXXXX-XX-XXXX), which goes in Section 3. Japanese chemical manufacturers cannot register directly under REACH, but the number obtained by the EU importer or Only Representative must be confirmed and included in the SDS.&lt;/p&gt;

&lt;p&gt;An SDS without a confirmed registration number prevents EU downstream users from fulfilling their regulatory obligations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UFI (Unique Formula Identifier)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The UFI links a mixture's composition to the notification submitted to the Poison Centre. Under Regulation (EU) 2017/542, UFIs are required — phased in from 2021 — on SDSs and labels for hazardous mixtures. UFIs are generated using the ECHA tool. Japanese SDS formats have no equivalent field.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Extended SDS (eSDS)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a substance requires a Chemical Safety Assessment (CSA) under REACH, an Exposure Scenario must be attached to the SDS — producing an extended SDS (eSDS). The receiving party must review the attached scenarios to confirm their conditions of use fall within the documented scope.&lt;/p&gt;

&lt;h2&gt;
  
  
  US-Specific Requirements
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;California Proposition 65&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;California's Safe Drinking Water and Toxic Enforcement Act of 1986 (Prop 65) requires warning labels for products containing listed carcinogens or reproductive toxins above specified thresholds. For products sold in California, note applicable Prop 65 substances in Section 15 and add the required warning on product labels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SARA Title III&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Superfund Amendments and Reauthorization Act (SARA) Title III requires facilities to report storage and releases of certain hazardous chemicals to the Local Emergency Planning Committee (LEPC). Section 15 should note the applicable sections: Section 302 (EHS list), Section 304, Section 311/312, and Section 313 (TRI reporting).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TSCA&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Section 15 should indicate whether each substance appears on the TSCA Section 8(b) Chemical Substance Inventory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Localization Requirements by Section
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Sections that always require localization&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Changes Required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Section 2 — Hazard Identification&lt;/td&gt;
&lt;td&gt;Apply CLP harmonised classification (EU); apply HazCom 2024 classification criteria (US)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 3 — Composition/Ingredient Information&lt;/td&gt;
&lt;td&gt;REACH registration number and EC number (EU); confirm TSCA inventory listing (US)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 8 — Exposure Controls/Personal Protection&lt;/td&gt;
&lt;td&gt;Replace with EU OELVs and destination member state OELs (EU); list OSHA PEL/TLV/NIOSH REL (US)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 15 — Regulatory Information&lt;/td&gt;
&lt;td&gt;Rewrite for REACH/CLP (EU) or OSHA HazCom/TSCA/SARA (US)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Sections frequently requiring changes&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Section&lt;/th&gt;
&lt;th&gt;Changes Required&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Section 1 — Identification&lt;/td&gt;
&lt;td&gt;UFI (EU); local emergency contact number&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 13 — Disposal Considerations&lt;/td&gt;
&lt;td&gt;Local legislation: EU Waste Framework Directive (WFD), US RCRA, etc.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 14 — Transport Information&lt;/td&gt;
&lt;td&gt;UN numbers are shared, but DOT (US) and ADR (EU road) have different notation requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Checklists
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Checklist: Japanese SDS → EU SDS
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Section 1: Is a UFI included? (required for hazardous mixtures)&lt;/li&gt;
&lt;li&gt;Section 1: Is the emergency contact number valid in the destination member state?&lt;/li&gt;
&lt;li&gt;Section 2: Where Annex VI harmonised classification exists, is it applied?&lt;/li&gt;
&lt;li&gt;Section 2: Has the substance been checked against the new hazard classes in (EU) 2023/707 and (EU) 2024/2865 (endocrine disruption, PBT/PMT)?&lt;/li&gt;
&lt;li&gt;Section 3: Is the REACH registration number included?&lt;/li&gt;
&lt;li&gt;Section 8: Are OEL values based on EU OELVs and the national OELs of the destination member state?&lt;/li&gt;
&lt;li&gt;Section 15: Are REACH Regulation, CLP Regulation, and the destination member state's implementing legislation listed?&lt;/li&gt;
&lt;li&gt;Language: Is the SDS in the official language of the destination member state?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Checklist: Japanese SDS → US SDS
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Section 2: Is the classification done under HazCom (2012 or 2024) criteria?&lt;/li&gt;
&lt;li&gt;Section 8: Are all three standards — OSHA PEL, ACGIH TLV, and NIOSH REL — listed?&lt;/li&gt;
&lt;li&gt;Section 15: Are OSHA Hazard Communication Standard (29 CFR 1910.1200), TSCA, and SARA Title III (where applicable) listed?&lt;/li&gt;
&lt;li&gt;Section 15: For products sold in California, are applicable Prop 65 substances noted?&lt;/li&gt;
&lt;li&gt;Language: Is the SDS in English?&lt;/li&gt;
&lt;li&gt;HazCom 2024 transition: Have the current OSHA compliance deadlines been confirmed?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;"GHS-based" is the common thread, but classification systems, mandatory content, applicable law, and update obligations differ across regions.&lt;/p&gt;

&lt;p&gt;As of May 2026:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;EU:&lt;/strong&gt; CLP has its own classification system. The 2023 regulatory updates — (EU) 2023/707 and (EU) 2024/2865 — added new hazard classes including endocrine disruption. Mixtures first entering the EU market from May 2026 must comply. The main differences from Japanese SDSs are the REACH registration number and UFI requirements, and the mandatory application of harmonised classification.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;US:&lt;/strong&gt; Transition to HazCom 2024 (GHS Revision 7 base) is ongoing. OSHA extended the compliance deadlines in January 2026 — check the OSHA website for current dates. The practice of listing OSHA PEL, ACGIH TLV, and NIOSH REL together in Section 8 remains unchanged.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Japan:&lt;/strong&gt; JIS Z 7253:2025 (aligned to GHS Revision 9) is due for mandatory enforcement by 2030, with a periodic review obligation in effect since 2023.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A translated Japanese SDS is missing the REACH registration number, UFI, and harmonised classification for EU use — and the OSHA-required OEL entries and SARA/TSCA information for US use. Working from a Japanese source, the localization work typically exceeds the translation work.&lt;/p&gt;

</description>
      <category>chemistry</category>
      <category>cheminformatics</category>
      <category>regulatory</category>
      <category>ghs</category>
    </item>
    <item>
      <title>Cheminformatics Databases in 2026: PubChem, ChEMBL, Regulatory Inventories, and API Access</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Wed, 27 May 2026 12:42:44 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/cheminformatics-databases-in-2026-pubchem-chembl-regulatory-inventories-and-api-access-4e98</link>
      <guid>https://dev.to/kent-tokyo/cheminformatics-databases-in-2026-pubchem-chembl-regulatory-inventories-and-api-access-4e98</guid>
      <description>&lt;p&gt;When you try to pull chemical structure or bioactivity data via API, every database has its own endpoint design, its own license terms, and its own coverage. Research DBs, national regulatory inventories, and Snowflake as of May 2026.&lt;/p&gt;




&lt;h2&gt;
  
  
  Research DB Overview
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DB&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;Bulk Download&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PubChem&lt;/td&gt;
&lt;td&gt;Compounds, bioactivity, toxicity, patents&lt;/td&gt;
&lt;td&gt;REST&lt;/td&gt;
&lt;td&gt;✓ (FTP / PUG Download)&lt;/td&gt;
&lt;td&gt;Public domain&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChEMBL&lt;/td&gt;
&lt;td&gt;Drug bioactivity&lt;/td&gt;
&lt;td&gt;REST + Python SDK&lt;/td&gt;
&lt;td&gt;✓ (TSV / SDF / SQL, FTP)&lt;/td&gt;
&lt;td&gt;CC BY-SA 3.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RCSB PDB&lt;/td&gt;
&lt;td&gt;Protein 3D structures&lt;/td&gt;
&lt;td&gt;GraphQL + REST&lt;/td&gt;
&lt;td&gt;✓ (FTP / rsync)&lt;/td&gt;
&lt;td&gt;CC0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DrugBank&lt;/td&gt;
&lt;td&gt;Drug info, DDI&lt;/td&gt;
&lt;td&gt;REST (registration required)&lt;/td&gt;
&lt;td&gt;Suspended&lt;/td&gt;
&lt;td&gt;CC BY-NC 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ZINC&lt;/td&gt;
&lt;td&gt;Purchasable compounds, 3D&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;✓ (SDF / SMILES, FTP)&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;BindingDB&lt;/td&gt;
&lt;td&gt;Binding affinity (Ki, IC50, etc.)&lt;/td&gt;
&lt;td&gt;Limited REST&lt;/td&gt;
&lt;td&gt;✓ (TSV, Excel-readable)&lt;/td&gt;
&lt;td&gt;CC BY 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UniChem&lt;/td&gt;
&lt;td&gt;Cross-DB ID translation&lt;/td&gt;
&lt;td&gt;REST&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  PubChem: The First Stop
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://pubchem.ncbi.nlm.nih.gov/" rel="noopener noreferrer"&gt;PubChem&lt;/a&gt; (NCBI) holds 119 million compounds and 295 million bioactivity records (&lt;a href="https://academic.oup.com/nar/article/53/D1/D1516/7903365" rel="noopener noreferrer"&gt;NAR 2025&lt;/a&gt;), aggregating from over 1,000 sources including ChEMBL and DrugBank. Public domain.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Compound name → CID&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/aspirin/cids/JSON"&lt;/span&gt;
&lt;span class="c"&gt;# CID → SMILES&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/2519/property/IsomericSMILES/JSON"&lt;/span&gt;
&lt;span class="c"&gt;# InChIKey → CID&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N/cids/JSON"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rate limit: 5 requests/sec, 400 requests/min. For tens of thousands of records or more, use PUG Download (up to 250,000 records per request) or FTP.&lt;/p&gt;

&lt;p&gt;PubChem aggregates from many sources, so conflicting values for the same compound are common. For IC50 data specifically, ChEMBL's curated records are more reliable.&lt;/p&gt;




&lt;h2&gt;
  
  
  ChEMBL: The Standard for ML Data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.ebi.ac.uk/chembl/" rel="noopener noreferrer"&gt;ChEMBL&lt;/a&gt; (EMBL-EBI) is a manually curated bioactivity database extracted from medicinal chemistry literature. It includes unit normalization and activity classification for IC50, Ki, and EC50 values, and has become the standard training data source for ML models.&lt;/p&gt;

&lt;p&gt;As of ChEMBL 36 (September 2025): 2.878 million compounds, 24.27 million activity records.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chembl_webresource_client.new_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;new_client&lt;/span&gt;

&lt;span class="n"&gt;molecule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;molecule&lt;/span&gt;
&lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;molecule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CHEMBL25&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;molecule_structures&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;canonical_smiles&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;activity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;activity&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;activity&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target_chembl_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CHEMBL205&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;standard_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IC50&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;molecule_chembl_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;standard_value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;standard_units&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full DB downloads are available via FTP in SQLite, MySQL, and PostgreSQL formats. &lt;code&gt;pip install chembl-downloader&lt;/code&gt; handles reproducible retrieval.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;CC BY-SA 3.0. Derivatives must be released under the same license.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  RCSB PDB: Protein 3D Structures
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.rcsb.org/" rel="noopener noreferrer"&gt;RCSB PDB&lt;/a&gt; holds 254,000 experimentally determined structures (204,000 X-ray, 34,000 cryo-EM, 14,000 NMR). Fully open (CC0), with complete bulk download via FTP.&lt;/p&gt;

&lt;p&gt;The GraphQL API lets you fetch structure info, ligands, and citations in a single request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;{ entry(entry_id: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1ATP&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;) {
  struct { title }
  rcsb_entry_info { resolution_combined }
} }&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://data.rcsb.org/graphql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;query&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;entry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;struct&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Cryo-EM structures have grown past 34,000 over the past few years, with large complexes and membrane proteins increasingly represented. AlphaFold 3 released source code and weights for non-commercial use between November 2024 and February 2025, enabling complex structure prediction for proteins, DNA, RNA, and small molecule ligands. AlphaFold DB (4.5 million users) predicted structures are accessible in parallel via RCSB.&lt;/p&gt;




&lt;h2&gt;
  
  
  DrugBank, ZINC, BindingDB
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;DrugBank&lt;/strong&gt; covers 11,891 drug entries — 4,563 approved, 6,231 investigational — plus 1.41 million drug-drug interactions (DrugBank 6.0, 2024). Indications, mechanisms, metabolism, and DDI data are all included, making it the typical starting point for drug repurposing research.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;⚠️ As of May 2026, academic dataset downloads are suspended (distribution method update in progress). API access continues.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;ZINC22&lt;/strong&gt; is a virtual screening resource: approximately 55 billion 2D compounds and 5.9 billion 3D docking-ready compounds (official figures). No REST API — access is via FTP or the web GUI. Primarily used with UCSF DOCK and AutoDock.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;BindingDB&lt;/strong&gt; has 3.2 million protein–small molecule binding affinity records (Ki, IC50, Kd, etc.). The TSV download opens directly in Excel and shows up often in DTI benchmark sets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;UniChem&lt;/strong&gt; handles ID translation only — ChEMBL ID, PubChem CID, InChIKey, DrugBank ID, and more. Any multi-DB pipeline will need it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# UniChem 2.0 API (POST form recommended; v1 GET is legacy-compatible)&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST &lt;span class="s2"&gt;"https://www.ebi.ac.uk/unichem/api/v1/compounds"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"type":"inchikey","compound":"BSYNRYMUTXBXSQ-UHFFFAOYSA-N"}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Japanese Databases (Free Only)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DB&lt;/th&gt;
&lt;th&gt;Contents&lt;/th&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;Bulk DL&lt;/th&gt;
&lt;th&gt;License&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;KEGG COMPOUND&lt;/td&gt;
&lt;td&gt;Metabolites, drugs, pathway integration&lt;/td&gt;
&lt;td&gt;REST (free for academic)&lt;/td&gt;
&lt;td&gt;Text format&lt;/td&gt;
&lt;td&gt;Academic free / commercial paid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SDBS (AIST)&lt;/td&gt;
&lt;td&gt;NMR/IR/MS spectra&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Not available (50/day limit)&lt;/td&gt;
&lt;td&gt;Non-commercial free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NITE-CHRIP&lt;/td&gt;
&lt;td&gt;Japanese regulatory info (CSCL, ISHL, etc.)&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Partial list&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Nikkaji RDF (NBDC)&lt;/td&gt;
&lt;td&gt;3.6M compounds from Nikkaji&lt;/td&gt;
&lt;td&gt;Bulk DL (SPARQL)&lt;/td&gt;
&lt;td&gt;RDF/TTL only&lt;/td&gt;
&lt;td&gt;CC BY 4.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;KEGG COMPOUND&lt;/strong&gt; integrates 19,572 pathway-registered compounds and 12,826 drugs with genomic and disease data. The REST API supports compound lookup, pathway cross-referencing, and BRITE hierarchy traversal. Commercial use requires a license via Pathway Solutions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Fetch compound by C number&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://rest.kegg.jp/get/C00031"&lt;/span&gt;
&lt;span class="c"&gt;# Convert KEGG compound IDs to ChEBI IDs&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://rest.kegg.jp/conv/chebi/compound"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;SDBS&lt;/strong&gt; (AIST) covers approximately 34,600 compounds with manually curated spectral data (FT-IR, EI-MS, ¹H NMR, ¹³C NMR). High reliability due to expert curation, but no API and a 50-spectra-per-day download limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NITE-CHRIP&lt;/strong&gt; provides cross-searchable regulatory information for approximately 300,000 substances under Japanese law (CSCL, ISHL, Poisonous and Deleterious Substances Control Act, etc.). Updated every two months. GHS classification lists are partially downloadable; the Excel format is sold separately by CIRS Group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nikkaji RDF&lt;/strong&gt; (NBDC) is the RDF release of the Japanese Chemical Substance Dictionary (Nikkaji). Over 3.6 million compounds, CC BY 4.0 — the only Japanese DB that allows commercial bulk download. SPARQL access required.&lt;/p&gt;




&lt;h2&gt;
  
  
  Regulatory Databases by Country
&lt;/h2&gt;

&lt;p&gt;Chemical substance inventories and regulatory DBs from various jurisdictions.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DB&lt;/th&gt;
&lt;th&gt;Region&lt;/th&gt;
&lt;th&gt;Coverage&lt;/th&gt;
&lt;th&gt;API&lt;/th&gt;
&lt;th&gt;CSV/DL&lt;/th&gt;
&lt;th&gt;English&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;ECHA CHEM&lt;/td&gt;
&lt;td&gt;EU&lt;/td&gt;
&lt;td&gt;C&amp;amp;L: 350K substances&lt;/td&gt;
&lt;td&gt;Not available&lt;/td&gt;
&lt;td&gt;✓ (Open Data Portal)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;eChemPortal&lt;/td&gt;
&lt;td&gt;OECD&lt;/td&gt;
&lt;td&gt;Multi-country&lt;/td&gt;
&lt;td&gt;Unconfirmed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EFSA OpenFoodTox&lt;/td&gt;
&lt;td&gt;EU&lt;/td&gt;
&lt;td&gt;7,880 substances&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;✓ (Zenodo)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NCIS&lt;/td&gt;
&lt;td&gt;Korea&lt;/td&gt;
&lt;td&gt;Full KECL&lt;/td&gt;
&lt;td&gt;Unconfirmed&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCISS&lt;/td&gt;
&lt;td&gt;China (3rd-party)&lt;/td&gt;
&lt;td&gt;IECSC 47K substances&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DSL&lt;/td&gt;
&lt;td&gt;Canada&lt;/td&gt;
&lt;td&gt;28K substances&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;✓ (CSV/XLSX)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIIC&lt;/td&gt;
&lt;td&gt;Australia&lt;/td&gt;
&lt;td&gt;40K substances&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;✓ (spreadsheet)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NITE-CHRIP&lt;/td&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;300K substances&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Partial&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  EU: ECHA CHEM
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://chem.echa.europa.eu/" rel="noopener noreferrer"&gt;ECHA CHEM&lt;/a&gt; is run by the European Chemicals Agency and is the largest public chemical DB. It covers the C&amp;amp;L Inventory (4,400+ harmonized classifications, 7 million+ industry self-classifications) and REACH registration data (physicochemical, toxicological, and ecotoxicological test results). Relaunched in January 2024, with the C&amp;amp;L module integrated in May 2025. Bulk data is available from the ECHA Open Data Portal.&lt;/p&gt;

&lt;p&gt;An official REST API is not yet available (announced as a future feature).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.echemportal.org/" rel="noopener noreferrer"&gt;eChemPortal&lt;/a&gt; (OECD) provides a single search interface across regulatory DBs from ECHA REACH, the US EPA, Japan's NITE-CHRIP, and others. It covers 27,000 REACH-registered substances and over 1.3 million endpoint records.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.efsa.europa.eu/en/data-report/chemical-hazards-database-openfoodtox" rel="noopener noreferrer"&gt;EFSA OpenFoodTox&lt;/a&gt; is a toxicological evaluation dataset for 7,880 food-relevant substances (food additives, pesticides, contaminants, etc.). Accessible via the Open EFSA API and downloadable through Zenodo.&lt;/p&gt;

&lt;h3&gt;
  
  
  China: CCISS / IECSC
&lt;/h3&gt;

&lt;p&gt;China's chemical inventory is the IECSC (现有化学物质名录), maintained by MEE (Ministry of Ecology and Environment). It covers 47,000 substances, but the official site is Chinese-only.&lt;/p&gt;

&lt;p&gt;The practical entry point is &lt;a href="https://cciss.cirs-group.com/" rel="noopener noreferrer"&gt;CCISS&lt;/a&gt; (a free tool by CIRS Group), which provides an English-language interface for searching IECSC by CAS number or English name. It is not official, but update frequency and accuracy are reliable. For anyone who cannot read Chinese, CCISS is effectively the only access point.&lt;/p&gt;

&lt;h3&gt;
  
  
  Korea: NCIS
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://ncis.nier.go.kr/en/main.do" rel="noopener noreferrer"&gt;NCIS&lt;/a&gt; (National Chemical Information System) is the official DB run by the National Institute of Environmental Research (NIER). It covers KECL (Korea Existing Chemicals List) for K-REACH compliance, hazardous chemical lists, and GHS classification data. It is one of the few Asian government chemical DBs with an English interface.&lt;/p&gt;

&lt;h3&gt;
  
  
  Canada and Australia
&lt;/h3&gt;

&lt;p&gt;Canada's &lt;a href="https://pollution-waste.canada.ca/substances-search/" rel="noopener noreferrer"&gt;DSL&lt;/a&gt; (Domestic Substances List, 28,000 substances) is available as CSV/XLSX from the Open Government Portal.&lt;/p&gt;

&lt;p&gt;Australia's &lt;a href="https://services.industrialchemicals.gov.au/search-inventory/" rel="noopener noreferrer"&gt;AIIC&lt;/a&gt; (Australian Inventory of Industrial Chemicals, 40,000 substances), maintained by AICIS, is published in spreadsheet format twice a year. Neither has an API; bulk download is the only programmatic option.&lt;/p&gt;




&lt;h2&gt;
  
  
  Snowflake Marketplace
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;There are no free chemical structure DB listings.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Major chemical structure DBs such as PubChem and ChEMBL are not available as Data Shares on the Snowflake Marketplace. Commercial listings like &lt;a href="https://app.snowflake.com/marketplace/listings/IQVIA" rel="noopener noreferrer"&gt;IQVIA&lt;/a&gt; (clinical and prescription data) and &lt;a href="https://www.drugpatentwatch.com/blog/drugpatentwatch-now-available-on-the-snowflake-marketplace/" rel="noopener noreferrer"&gt;DrugPatentWatch&lt;/a&gt; (patent data) exist, but they do not contain molecular structure data.&lt;/p&gt;

&lt;p&gt;Snowflake works as a platform rather than a data source for chemistry:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RDKit can be installed as a Snowpark Python UDF, making fingerprint calculation and similarity search available as SQL&lt;/li&gt;
&lt;li&gt;ChEMBL and PubChem data are loaded via FTP download and Snowpark ingestion as the standard workflow&lt;/li&gt;
&lt;li&gt;AWS terminated its S3 hosting of ChEMBL, so FTP is now the standard retrieval method
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# RDKit via Snowpark Python UDF
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.functions&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;udf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;snowflake.snowpark.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;

&lt;span class="nd"&gt;@udf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;return_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;input_types&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;()])&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;canonical_smiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;smiles&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;rdkit&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chem&lt;/span&gt;
    &lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Chem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MolFromSmiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;smiles&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;Chem&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MolToSmiles&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mol&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;mol&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As of May 2026, there is no indication that PubChem or ChEMBL are participating as Snowflake Data Shares.&lt;/p&gt;




&lt;h2&gt;
  
  
  CAS Number Search
&lt;/h2&gt;

&lt;p&gt;CAS numbers (e.g., 50-78-2 for aspirin) are the practical cross-DB identifier. API support varies by database.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;DB&lt;/th&gt;
&lt;th&gt;CAS Search&lt;/th&gt;
&lt;th&gt;Method&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;PubChem&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;REST API (treated as compound name)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChEMBL&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Python SDK (synonym filter)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;KEGG&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;REST API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UniChem&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;POST API (cross-reference)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ECHA CHEM&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI only (no API)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NITE-CHRIP&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NCIS (Korea)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCISS (China)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DSL (Canada)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI + CSV download&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AIIC (Australia)&lt;/td&gt;
&lt;td&gt;✓&lt;/td&gt;
&lt;td&gt;Web UI + spreadsheet&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  PubChem
&lt;/h3&gt;

&lt;p&gt;CAS numbers can be passed directly as compound names.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CAS number → CID&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/cids/JSON"&lt;/span&gt;

&lt;span class="c"&gt;# CAS number → SMILES, molecular weight, and formula in one request&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/50-78-2/property/MolecularFormula,MolecularWeight,IsomericSMILES/JSON"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  ChEMBL
&lt;/h3&gt;

&lt;p&gt;CAS numbers are stored in ChEMBL's synonyms table.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chembl_webresource_client.new_client&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;new_client&lt;/span&gt;

&lt;span class="n"&gt;molecule&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;new_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;molecule&lt;/span&gt;
&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;molecule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;molecule_synonyms__synonym&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50-78-2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;molecule_chembl_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;molecule_structures&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;canonical_smiles&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  KEGG
&lt;/h3&gt;

&lt;p&gt;The KEGG REST API accepts CAS numbers directly in the find endpoint.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Search KEGG compounds by CAS number&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://rest.kegg.jp/find/compound/50-78-2"&lt;/span&gt;

&lt;span class="c"&gt;# Returns TSV like: C01405\t50-78-2&lt;/span&gt;
&lt;span class="c"&gt;# Then fetch details by C number&lt;/span&gt;
curl &lt;span class="s2"&gt;"https://rest.kegg.jp/get/C01405"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;find&lt;/code&gt; endpoint also accepts compound names, molecular formulas, and molecular weight ranges.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bulk CAS → SMILES via PubChem
&lt;/h3&gt;

&lt;p&gt;For pipeline use, the PUG REST POST endpoint handles batch conversion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;cas_list&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;50-78-2&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;64-17-5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;7732-18-5&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# aspirin, ethanol, water
&lt;/span&gt;
&lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/name/property/IsomericSMILES/JSON&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cas_list&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PropertyTable&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Properties&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]):&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;CID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prop&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;IsomericSMILES&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;PubChem treats CAS numbers as chemical names. One CAS number can map to multiple CIDs (salts, hydrates, stereoisomers) — if you get multiple results, take the first CID or filter by InChIKey.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Data Quality
&lt;/h2&gt;

&lt;p&gt;A 2025 EPA paper flagged the propagation of incorrect CAS numbers and stereochemical information across multiple databases. PubChem aggregates from many sources, so conflicting values between them are common. For publications or regulatory submissions, check the primary source — literature or experimental data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://academic.oup.com/nar/article/53/D1/D1516/7903365" rel="noopener noreferrer"&gt;PubChem 2025 update | Nucleic Acids Research&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://chembl.blogspot.com/2024/12/heres-nice-christmas-gift-chembl-35-is.html" rel="noopener noreferrer"&gt;ChEMBL 35 is out | ChEMBL Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.embl.org/news/updates-from-data-resources/chembl-36/" rel="noopener noreferrer"&gt;ChEMBL 36 is live | EMBL-EBI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9976280/" rel="noopener noreferrer"&gt;ZINC-22 | PMC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://academic.oup.com/nar/article/52/D1/D1265/7416367" rel="noopener noreferrer"&gt;DrugBank 6.0 | Nucleic Acids Research&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC11701568/" rel="noopener noreferrer"&gt;BindingDB in 2024 | PMC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.rcsb.org/news/feature/69405bd2bcb807eabf14c690" rel="noopener noreferrer"&gt;RCSB PDB 2025 milestone | RCSB&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12807749/" rel="noopener noreferrer"&gt;AlphaFold Protein Structure Database 2025 | PMC&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.kegg.jp/kegg/rest/keggapi.html" rel="noopener noreferrer"&gt;KEGG API Manual&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.nite.go.jp/en/chem/index.html" rel="noopener noreferrer"&gt;NITE-CHRIP | NITE&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://chem.echa.europa.eu/" rel="noopener noreferrer"&gt;ECHA CHEM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.echemportal.org/" rel="noopener noreferrer"&gt;eChemPortal | OECD&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.efsa.europa.eu/en/data-report/chemical-hazards-database-openfoodtox" rel="noopener noreferrer"&gt;EFSA OpenFoodTox&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ncis.nier.go.kr/en/main.do" rel="noopener noreferrer"&gt;NCIS | NIER Korea&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://cciss.cirs-group.com/" rel="noopener noreferrer"&gt;CCISS | CIRS Group&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://pollution-waste.canada.ca/substances-search/" rel="noopener noreferrer"&gt;DSL | Canada&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://services.industrialchemicals.gov.au/search-inventory/" rel="noopener noreferrer"&gt;AIIC | AICIS Australia&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.drugpatentwatch.com/blog/drugpatentwatch-now-available-on-the-snowflake-marketplace/" rel="noopener noreferrer"&gt;DrugPatentWatch on Snowflake Marketplace&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/snowflake/cheminformatics-in-snowflake-using-rdkit-snowpark-to-analyze-molecular-data-9136afb2b10f" rel="noopener noreferrer"&gt;Cheminformatics in Snowflake | Medium&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cheminformatics</category>
      <category>chemistry</category>
      <category>api</category>
      <category>bioinformatics</category>
    </item>
    <item>
      <title>Cheminformatics in Rust in 2025-2026: What Exists, What Doesn't, and Why</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Mon, 25 May 2026 13:07:26 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/cheminformatics-in-rust-in-2025-2026-what-exists-what-doesnt-and-why-e3h</link>
      <guid>https://dev.to/kent-tokyo/cheminformatics-in-rust-in-2025-2026-what-exists-what-doesnt-and-why-e3h</guid>
      <description>&lt;p&gt;RDKit has been the dominant cheminformatics library since its open-source release in 2006. It is written in C++, wrapped in Python, and has accumulated nearly two decades of validated chemistry: SMILES and SMARTS parsing, multiple fingerprint types, 2D coordinate generation, 3D conformer generation, MMFF94 and UFF force fields, a PostgreSQL cartridge. Most cheminformatics pipelines assume it is present.&lt;/p&gt;

&lt;p&gt;In mid-2026, Rust's answer is &lt;code&gt;rdkit-sys&lt;/code&gt; — bindings to RDKit's C++ CFFI interface — and a collection of pure-Rust crates that stalled in 2020-2021.&lt;/p&gt;

&lt;h2&gt;
  
  
  What exists in 2025-2026
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Crate&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Latest&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;rdkit-sys&lt;/td&gt;
&lt;td&gt;C++ FFI to RDKit&lt;/td&gt;
&lt;td&gt;0.4.12 (Oct 2024)&lt;/td&gt;
&lt;td&gt;Maintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;openbabel&lt;/td&gt;
&lt;td&gt;C++ FFI to Open Babel&lt;/td&gt;
&lt;td&gt;0.5.4 (Jan 2025)&lt;/td&gt;
&lt;td&gt;Maintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;chemcore&lt;/td&gt;
&lt;td&gt;Pure Rust&lt;/td&gt;
&lt;td&gt;0.4.1 (Feb 2021)&lt;/td&gt;
&lt;td&gt;Unmaintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;purr&lt;/td&gt;
&lt;td&gt;Pure Rust (SMILES parser)&lt;/td&gt;
&lt;td&gt;0.9.0 (Mar 2021)&lt;/td&gt;
&lt;td&gt;Unmaintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;smiles-parser&lt;/td&gt;
&lt;td&gt;Pure Rust (SMILES parser)&lt;/td&gt;
&lt;td&gt;0.4.1 (Nov 2020)&lt;/td&gt;
&lt;td&gt;Unmaintained&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;cosmolkit&lt;/td&gt;
&lt;td&gt;Pure Rust (new attempt)&lt;/td&gt;
&lt;td&gt;0.2.3 (May 2026)&lt;/td&gt;
&lt;td&gt;New, unproven&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The pattern in the pure-Rust column is consistent: implementations hit a wall around 2020-2021 and stopped. The active work is FFI bindings to existing C++ tools. A new attempt (&lt;code&gt;cosmolkit&lt;/code&gt;) appeared recently with an ambitious scope — SMILES, SDF, conformers, molecular graphs — but with under 800 downloads it is too early to evaluate.&lt;/p&gt;

&lt;h2&gt;
  
  
  SMILES parsing is solved. The rest is not.
&lt;/h2&gt;

&lt;p&gt;Parsing a SMILES string is a context-free grammar problem, and Rust handles those well. &lt;code&gt;purr&lt;/code&gt; implements the full OpenSMILES specification. &lt;code&gt;smiles-parser&lt;/code&gt; does the same. Both work. Neither has had a release since 2020-2021.&lt;/p&gt;

&lt;p&gt;The problem starts after parsing.&lt;/p&gt;

&lt;p&gt;A SMILES string like &lt;code&gt;c1ccccc1&lt;/code&gt; (benzene) uses lowercase atoms to indicate aromaticity. To do anything useful — calculate molecular weight, count implicit hydrogens, check valence — you need to convert it to a Kekulé structure: alternating single and double bonds. This is kekulization, and it is a constraint-satisfaction problem on the molecular graph.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;chemcore&lt;/code&gt;, the most complete pure-Rust attempt, has supported kekulization since its initial release (v0.1.x, June 2020). A benchmark published alongside v0.3.1 in October 2020 showed it handling edge cases that RDKit cannot. But kekulization is one step. What chemcore does not have: fingerprints, 2D coordinate generation, SMARTS matching, or stereochemistry. The last release was February 2021. Getting past kekulization turned out not to be the finishing line.&lt;/p&gt;

&lt;h2&gt;
  
  
  Aromaticity: no agreed definition
&lt;/h2&gt;

&lt;p&gt;Even with kekulization in place, aromaticity perception is harder than it looks — partly because aromaticity itself has no single agreed-upon definition in cheminformatics.&lt;/p&gt;

&lt;p&gt;Hückel's rule — 4n+2 π electrons — works for monocyclic systems. For polycyclic aromatics and heteroaromatics, implementations diverge. Daylight's original SMILES aromatic model differs from RDKit's model, which differs from CDK's. An algorithm that kekulizes correctly under one model may fail under another.&lt;/p&gt;

&lt;p&gt;Any pure-Rust toolkit that wants to produce output compatible with RDKit-generated SMILES needs to match RDKit's aromaticity behavior exactly, not implement some variant of Hückel. That requires reading RDKit's source code and testing against its outputs. It is months of work before any of it is visible to end users.&lt;/p&gt;

&lt;h2&gt;
  
  
  2D coordinate generation: not attempted
&lt;/h2&gt;

&lt;p&gt;Every cheminformatics toolkit ships 2D depiction — you cannot work with molecules you cannot see. The layout problem is harder than it looks.&lt;/p&gt;

&lt;p&gt;RDKit ships its own 2D depiction engine (&lt;code&gt;rdDepictor&lt;/code&gt;) and also integrates Schrodinger's &lt;code&gt;CoordGen&lt;/code&gt; library because &lt;code&gt;rdDepictor&lt;/code&gt; alone produces clashing depictions for complex ring systems. Two tools are needed because neither is sufficient alone. CoordGen works by matching known ring scaffold templates and running iterative geometry optimization for everything else.&lt;/p&gt;

&lt;p&gt;No pure-Rust crate has attempted 2D coordinate generation. Getting it right requires ring perception, a library of scaffold templates, and an optimization pass to resolve clashes. It is a multi-month project, and the output is still wrong until enough templates are added.&lt;/p&gt;

&lt;h2&gt;
  
  
  Substructure search: the graph is not the chemistry
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;petgraph&lt;/code&gt; (v0.8.3, 377M total downloads) provides VF2-based subgraph isomorphism and is actively maintained. VF2 is the standard algorithm for this — roughly an order of magnitude faster than Ullmann on typical molecule-sized graphs. The graph infrastructure exists in Rust.&lt;/p&gt;

&lt;p&gt;SMARTS matching, which is how substructure search works in cheminformatics, requires more than graph isomorphism. A SMARTS pattern &lt;code&gt;[#6;r6]&lt;/code&gt; means "a carbon atom in a 6-membered ring." Matching it requires: parsing SMARTS syntax, knowing which atoms belong to which rings, and matching node attributes with chemical semantics — atomic number, formal charge, aromaticity flag, implicit hydrogen count.&lt;/p&gt;

&lt;p&gt;Connecting &lt;code&gt;petgraph&lt;/code&gt;'s isomorphism to a chemistry-aware molecular graph is exactly the glue code that no published Rust crate provides.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why bindings are the rational choice
&lt;/h2&gt;

&lt;p&gt;RDKit's changelog goes back to 2006. The codebase contains 200+ molecular descriptors, MMFF94 and UFF force fields with their respective validation papers, an ETKDG 3D conformer generator that uses torsion angle statistics from the Cambridge Structural Database, and a PostgreSQL cartridge for large-scale screening. The Python ecosystem wraps all of this: &lt;code&gt;chembl_webresource_client&lt;/code&gt; for ChEMBL API access, PandasTools, scikit-learn integration for ML on fingerprints.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;rdkit-sys&lt;/code&gt; exposes a fraction of this via RDKit's CFFI interface. Choosing bindings over a rewrite is not a concession. It is what you do when you look at how much chemistry is embedded in that C++ code and how long it took to get there.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed in 2024-2025, and what 2026 adds so far
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;2024-2025:&lt;/strong&gt; &lt;code&gt;rdkit-sys&lt;/code&gt; had three releases in 2024, the last in October, and moved into the &lt;code&gt;rdkit-rs/rdkit&lt;/code&gt; monorepo. &lt;code&gt;openbabel&lt;/code&gt; (Rust bindings) released 0.5.4 in January 2025 — it exposes Open Babel's &lt;code&gt;OBSmartsPattern&lt;/code&gt;, which matters if you need substructure search without pulling in RDKit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2026:&lt;/strong&gt; The only 2026-specific addition is &lt;code&gt;cosmolkit&lt;/code&gt; (v0.2.3, May 2026, 778 downloads). It claims an ambitious scope — SMILES, SDF, conformers, molecular graphs, "AI-ready workflows" — but it is too new to evaluate. Whether it addresses aromaticity perception and 2D layout, the parts that stopped every earlier attempt, is not clear from the current documentation.&lt;/p&gt;

&lt;p&gt;As of this writing, nothing else has shipped in 2026. The structural gap between Rust and Python cheminformatics is the same as it was in 2025.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual hard part
&lt;/h2&gt;

&lt;p&gt;The challenging problems in cheminformatics are not Rust-specific. Ownership and lifetimes will slow you down on day one; aromaticity will block you on month three. The chemistry fundamentals — aromaticity perception, 2D layout, stereochemistry, substructure matching — require domain knowledge that does not come from a Rust tutorial.&lt;/p&gt;

&lt;p&gt;RDKit did not get where it is because C++ is better than Rust. It got there because a team of chemists and programmers spent two decades solving specific, hard chemistry problems. Whoever builds the Rust equivalent will need to solve the same problems.&lt;/p&gt;

&lt;p&gt;I have been working around these gaps while building &lt;a href="https://github.com/kent-tokyo/chem-wasm-lens" rel="noopener noreferrer"&gt;chem-wasm-lens&lt;/a&gt;, a pure-Rust molecular analysis library targeting the browser via WebAssembly. Restricting scope — no SMARTS, no full stereochemistry — made it possible to ship. But restricted scope is different from a general-purpose toolkit, and that distinction matters.&lt;/p&gt;

</description>
      <category>cheminformatics</category>
      <category>rust</category>
      <category>chemistry</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Rust in 2025-2026: From 'Most Loved Language' to Core Infrastructure</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sun, 24 May 2026 04:38:04 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/rust-in-2025-2026-from-most-loved-language-to-core-infrastructure-4l5k</link>
      <guid>https://dev.to/kent-tokyo/rust-in-2025-2026-from-most-loved-language-to-core-infrastructure-4l5k</guid>
      <description>&lt;p&gt;Rust has held the top spot in Stack Overflow's "most admired language" survey since 2016, nearly without interruption. But what happened in 2025 is no longer just about popularity polls. Rust is quietly but steadily becoming the language that underpins critical infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  "Experimental" Status Removed from the Linux Kernel
&lt;/h2&gt;

&lt;p&gt;In December 2025, at the Kernel Maintainers Summit held in Tokyo, Rust's status in the Linux kernel was elevated from experimental to an officially recognized implementation language.&lt;/p&gt;

&lt;p&gt;As &lt;a href="https://lwn.net/Articles/1049831/" rel="noopener noreferrer"&gt;LWN.net reported&lt;/a&gt;, the outcome was unambiguous: "The consensus among the assembled developers is that Rust in the kernel is no longer experimental — it is now a core part of the kernel and is here to stay." Maintainer Steven Rostedt noted there was zero pushback in the room. About five years after Linus first suggested the possibility in 2020, the matter was settled.&lt;/p&gt;

&lt;p&gt;Rust code in the kernel currently stands at around 25,000 lines (compared to 34 million lines of C) — still a small share — but the subsystems adopting it are steadily expanding.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rust Code Running in Production
&lt;/h3&gt;

&lt;p&gt;Major components using Rust in the kernel today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;PHY drivers&lt;/strong&gt; — network physical layer&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;null block driver&lt;/strong&gt; — test block device&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Android Binder driver&lt;/strong&gt; — kernel-side implementation of Android's IPC mechanism&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Apple AGX GPU driver&lt;/strong&gt; — for Apple Silicon, via the Asahi Linux project&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Nova GPU driver&lt;/strong&gt; — for NVIDIA Turing-generation hardware (RTX 20 / GTX 16 series)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Nova driver is architecturally interesting. It is split into two crates: &lt;code&gt;nova-core&lt;/code&gt; (hardware initialization and communication) and &lt;code&gt;nova-drm&lt;/code&gt; (Linux DRM API implementation), using an adapter pattern that maps different bus types — PCI, platform, USB — to the same types. Note that as of early 2026, full enablement is still in progress; development is ongoing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Rust is Well-Suited for the Kernel
&lt;/h3&gt;

&lt;p&gt;The majority of kernel driver bugs stem from memory safety issues: NULL pointer dereferences, use-after-free, buffer overflows, data races. In C, the only mitigation is "write carefully." Rust eliminates these classes of bugs at compile time, by construction.&lt;/p&gt;

&lt;p&gt;Greg Kroah-Hartman has stated that Rust drivers are safer than their C counterparts, and this is exactly why. The 25,000-line figure is small, but it also means Rust is replacing the parts where "bugs would have been guaranteed if written in C."&lt;/p&gt;

&lt;h2&gt;
  
  
  The Technical Story Behind &lt;code&gt;async closures&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Among the stabilizations in 2025, &lt;code&gt;async closures&lt;/code&gt; (Rust 1.85) were a deeper change than they appear.&lt;/p&gt;

&lt;h3&gt;
  
  
  What was wrong with the old workarounds?
&lt;/h3&gt;

&lt;p&gt;Previously, writing an "async closure" looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// The closure couldn't borrow captured variables inside the Future.&lt;/span&gt;
&lt;span class="c1"&gt;// You had to either move ownership in, or wrap with Arc&amp;lt;Mutex&amp;lt;T&amp;gt;&amp;gt;.&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="nf"&gt;.clone&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="c1"&gt;// clone required&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The problem with &lt;code&gt;|x| async move { ... }&lt;/code&gt; was that the returned &lt;code&gt;Future&lt;/code&gt; could not borrow captures from the closure itself. Because the &lt;code&gt;Future&lt;/code&gt; has a different lifetime from the closure, you had no way to pass references — you had to either move ownership or clone.&lt;/p&gt;

&lt;h3&gt;
  
  
  The &lt;code&gt;AsyncFn&lt;/code&gt; Trait Hierarchy
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;async closures&lt;/code&gt; introduced in Rust 1.85 come with a new trait hierarchy internally: &lt;code&gt;AsyncFn&lt;/code&gt; / &lt;code&gt;AsyncFnMut&lt;/code&gt; / &lt;code&gt;AsyncFnOnce&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;AsyncFnOnce
  └─ AsyncFnMut
       └─ AsyncFn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mirrors the existing &lt;code&gt;Fn*&lt;/code&gt; traits, but with a critical difference: the &lt;code&gt;Future&lt;/code&gt; returned by an &lt;code&gt;async closure&lt;/code&gt; &lt;strong&gt;can borrow from the closure itself (lending)&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nd"&gt;vec!&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

&lt;span class="c1"&gt;// Rust 1.85+: the Future can borrow data directly&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;f&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="p"&gt;|&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;|&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="c1"&gt;// Usage at function boundaries&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="n"&gt;apply&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;AsyncFn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;usize&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;i32&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nf"&gt;f&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This works because &lt;code&gt;AsyncFnMut&lt;/code&gt;'s &lt;code&gt;CallRefFuture&lt;/code&gt; associated type is designed to propagate the &lt;code&gt;&amp;amp;mut self&lt;/code&gt; lifetime into the &lt;code&gt;Future&lt;/code&gt;. Just as &lt;code&gt;FnMut&lt;/code&gt; returns &lt;code&gt;&amp;amp;mut self&lt;/code&gt; to the caller, &lt;code&gt;AsyncFnMut&lt;/code&gt; lets the &lt;code&gt;Future&lt;/code&gt; hold that lifetime.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;let chains&lt;/code&gt; — Quietly Important
&lt;/h3&gt;

&lt;p&gt;Also stabilized in 2025, &lt;code&gt;let chains&lt;/code&gt; (Rust 1.88, Rust 2024 edition only) look simple but are significant: you can now freely combine &lt;code&gt;if let&lt;/code&gt; patterns and ordinary &lt;code&gt;bool&lt;/code&gt; conditions with &lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt;. Using this requires &lt;code&gt;edition = "2024"&lt;/code&gt; in &lt;code&gt;Cargo.toml&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="c1"&gt;// before: forced to nest or introduce temp variables&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="nf"&gt;.is_active&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="nf"&gt;.role&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// after: flatten conditions and pattern matches on one line&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_user&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="nf"&gt;.is_active&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="nf"&gt;.role&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nd"&gt;println!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"{}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same syntax works in &lt;code&gt;while let&lt;/code&gt; and &lt;code&gt;match&lt;/code&gt; guards, visibly reducing nesting depth throughout a codebase.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety Certification: A New Frontier
&lt;/h2&gt;

&lt;p&gt;In December 2025, Ferrous Systems obtained IEC 61508 SIL 2 certification from TÜV SÜD for a subset of Rust's &lt;code&gt;core&lt;/code&gt; library, under &lt;strong&gt;Ferrocene 25.11.0&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is IEC 61508?
&lt;/h3&gt;

&lt;p&gt;IEC 61508 is an international standard for functional safety of electrical, electronic, and programmable electronic safety-related systems. SIL (Safety Integrity Level) 2 corresponds to a "probability of dangerous failure of 10^-7 to 10^-6 per hour" — the level required in safety-critical domains such as aerospace, medical devices, industrial machinery, and automotive.&lt;/p&gt;

&lt;p&gt;Historically, the de facto standard for safety-critical embedded systems has been C/C++ with MISRA. Rust achieving this level of certification means it has officially stepped into the domain where functional safety is required — not just systems programming for general use.&lt;/p&gt;

&lt;h3&gt;
  
  
  MISRA C 2025 Addendum 6
&lt;/h3&gt;

&lt;p&gt;The same year, MISRA C 2025 Addendum 6 was published: an assessment of how MISRA C rules apply to Rust. The conclusion is that many existing C-specific rules are simply not applicable to Rust — the compiler enforces them by construction via the ownership model. This document also lays the groundwork for a future Rust-specific MISRA rule set.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Adoption Approaching 50%
&lt;/h2&gt;

&lt;p&gt;According to JetBrains' &lt;a href="https://blog.jetbrains.com/rust/2026/02/11/state-of-rust-2025/" rel="noopener noreferrer"&gt;State of Rust Ecosystem 2025&lt;/a&gt;, roughly half of surveyed organizations are now using Rust in production in non-trivial ways — a significant jump from around 38-39% in 2023.&lt;/p&gt;

&lt;p&gt;Microsoft has started rewriting low-level Windows components in Rust, and Google, Amazon, and Meta have each introduced it into OS-level systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  An Unexpected Fit for the LLM Era
&lt;/h2&gt;

&lt;p&gt;In an era dominated by generative AI, Rust is being reassessed in an unexpected way.&lt;/p&gt;

&lt;p&gt;LLM-generated code contains mistakes. Rust's compiler returns those mistakes immediately and concretely as build errors. The type system and borrow checker enumerate the mistakes an AI made — before the code ever runs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;error[E0502]: cannot borrow `data` as mutable because it is also borrowed as immutable
  --&amp;gt; src/main.rs:8:5
   |
6  |     let r = &amp;amp;data;
   |             ----- immutable borrow occurs here
7  |     println!("{}", r);
8  |     data.push(4);
   |     ^^^^^^^^^^^^ mutable borrow occurs here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tight feedback loop — "if it compiles, memory safety is guaranteed" — is well-suited for pair programming with an LLM. Fixing Rust compile errors has higher reproducibility than debugging Python runtime errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Biggest Concern: Adoption May Plateau
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://blog.rust-lang.org/2026/03/02/2025-State-Of-Rust-Survey-results/" rel="noopener noreferrer"&gt;State of Rust Survey 2025&lt;/a&gt; found that the top concern is not a technical one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;"Not enough adoption in the tech industry" came in first at 42.1%&lt;/strong&gt; (narrowly ahead of "the language is getting too complex" at 41.6%).&lt;/p&gt;

&lt;p&gt;The learning curve remains steep and unresolved. Some feel the language itself is growing more complex over time. In early-stage startups where velocity is critical, Rust's upfront cost is real.&lt;/p&gt;

&lt;p&gt;That said, this concern is also the voice of people who want to use Rust more but see adoption lagging. The fact that technical concerns did not dominate the top of the list suggests Rust has reached a meaningful level of maturity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;In one sentence: as of 2026, Rust is in the early stages of a shift from "the language people love" to "the language people need."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Linux kernel&lt;/strong&gt; — Experimental status removed. Real drivers like Nova and Apple AGX are running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;async closures&lt;/strong&gt; — The &lt;code&gt;AsyncFn&lt;/code&gt; trait hierarchy lets Futures borrow from their enclosing closure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;let chains&lt;/strong&gt; — Flatten &lt;code&gt;if let&lt;/code&gt; + bool conditions, reducing nesting without temp variables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ferrocene / IEC 61508&lt;/strong&gt; — Rust now has a formal foothold in safety-critical domains.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM compatibility&lt;/strong&gt; — The compiler becomes a clear feedback loop for AI-generated code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Top concern is adoption speed&lt;/strong&gt; — Technical concerns have receded; ecosystem breadth is now the focus.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;Sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://blog.jetbrains.com/rust/2026/02/11/state-of-rust-2025/" rel="noopener noreferrer"&gt;State of Rust Ecosystem 2025 | JetBrains&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://blog.rust-lang.org/2026/03/02/2025-State-Of-Rust-Survey-results/" rel="noopener noreferrer"&gt;2025 State of Rust Survey Results | Rust Blog&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://lwn.net/Articles/1049831/" rel="noopener noreferrer"&gt;The (successful) end of the kernel Rust experiment | LWN.net&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rust-for-linux.com/nova-gpu-driver" rel="noopener noreferrer"&gt;Nova GPU Driver | Rust for Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rust-for-linux.com/apple-agx-gpu-driver" rel="noopener noreferrer"&gt;Apple AGX GPU driver | Rust for Linux&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rust-lang.github.io/rfcs/3668-async-closures.html" rel="noopener noreferrer"&gt;RFC 3668: async closures | Rust RFC Book&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.infoworld.com/article/3812600/async-closure-support-is-stable-for-rust-1-85.html" rel="noopener noreferrer"&gt;Async closure support is stable for Rust 1.85 | InfoWorld&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ferrous-systems.com/blog/ferrocene-libcore-news-release/" rel="noopener noreferrer"&gt;Ferrous Systems achieves IEC 61508 SIL 2 for Rust core | Ferrous Systems&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>rust</category>
      <category>linux</category>
      <category>programming</category>
      <category>systems</category>
    </item>
    <item>
      <title>Open-source SDS tooling for Japanese MHLW compliance: the gap nobody filled</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Sat, 23 May 2026 01:34:49 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/open-source-sds-tooling-for-japanese-mhlw-compliance-the-gap-nobody-filled-6o</link>
      <guid>https://dev.to/kent-tokyo/open-source-sds-tooling-for-japanese-mhlw-compliance-the-gap-nobody-filled-6o</guid>
      <description>&lt;p&gt;In March 2025, Japan's Ministry of Health, Labour and Welfare (MHLW) published a structured JSON schema for Safety Data Sheet data exchange. The schema covers roughly 200 deeply nested fields and is intended to standardize how SDS information moves between chemical management systems.&lt;/p&gt;

&lt;p&gt;Most SDS tooling was not built for this.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes Japan's SDS requirements different
&lt;/h2&gt;

&lt;p&gt;Japan's SDS requirements come from two laws: the Industrial Safety and Health Act (ISAH, 労働安全衛生法) and the Chemical Substances Control Law (化審法). Both mandate SDS for regulated chemicals, with format requirements governed by JIS Z 7253 — Japan's implementation of the UN Globally Harmonized System (GHS).&lt;/p&gt;

&lt;p&gt;JIS Z 7253 follows the standard 16-section GHS structure. In principle, any GHS-compliant SDS satisfies the content requirements. What makes Japanese compliance distinct is a digital layer: the MHLW schema specifies how SDS content should be structured as machine-readable data, with field-level granularity that PDF documents cannot capture.&lt;/p&gt;

&lt;h3&gt;
  
  
  How GHS looks different by country
&lt;/h3&gt;

&lt;p&gt;GHS uses a "building block" approach — each country adopts the elements it chooses. The result is that the same GHS-aligned document varies by jurisdiction:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Country/Region&lt;/th&gt;
&lt;th&gt;Standard&lt;/th&gt;
&lt;th&gt;GHS basis&lt;/th&gt;
&lt;th&gt;Notable difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japan&lt;/td&gt;
&lt;td&gt;JIS Z 7253:2019&lt;/td&gt;
&lt;td&gt;GHS Rev. 6&lt;/td&gt;
&lt;td&gt;MHLW digital schema; revised to GHS Rev. 9 in Dec 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;United States&lt;/td&gt;
&lt;td&gt;OSHA HazCom 2012&lt;/td&gt;
&lt;td&gt;GHS Rev. 3&lt;/td&gt;
&lt;td&gt;Updated to GHS Rev. 7 in 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;European Union&lt;/td&gt;
&lt;td&gt;CLP Regulation&lt;/td&gt;
&lt;td&gt;GHS-aligned&lt;/td&gt;
&lt;td&gt;Stricter on environmental hazards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;China&lt;/td&gt;
&lt;td&gt;GB 13690-2009&lt;/td&gt;
&lt;td&gt;GHS Rev. 4 equivalent&lt;/td&gt;
&lt;td&gt;Moving to GB 30000.1-2024 (GHS Rev. 8), mandatory from August 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Taiwan&lt;/td&gt;
&lt;td&gt;CNS 15030&lt;/td&gt;
&lt;td&gt;GHS-aligned&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Japan-specific regulatory fields
&lt;/h3&gt;

&lt;p&gt;The MHLW schema includes fields with no equivalent in EU REACH or US OSHA HazCom formats. These are the main reason international SDS tooling does not cover the schema out of the box:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Law&lt;/th&gt;
&lt;th&gt;Example fields&lt;/th&gt;
&lt;th&gt;What they capture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Chemical Substances Control Law (化審法)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;CaSCL.ClassificationStatus&lt;/code&gt;, &lt;code&gt;CaSCL.RegistrationNumber&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Regulatory classification and registration numbers under this law&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Industrial Safety and Health Act (安衛法)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ISHAct.PublicationOfName&lt;/code&gt;, &lt;code&gt;ISHAct.Notification&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Name disclosure and notification obligations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Poisonous and Deleterious Substances Control Law&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ControlledSubstancesAct.Applicability&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Whether the substance is classified as poison, deleterious, or specific poison&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PRTR Law&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Chemical release and transfer reporting obligations&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Section 15 (Regulatory Information) is the most complex section in the schema — it contains separate subsections for each of these laws, each with its own field structure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters now: the 2022 law revision
&lt;/h2&gt;

&lt;p&gt;The MHLW published the schema in 2025, but the driver was a 2022 amendment to the Industrial Safety and Health Act. The amendment shifted Japan's chemical substance regulation from a prescriptive model (government designates specific hazardous substances) to an autonomous management model (companies assess and manage risk themselves).&lt;/p&gt;

&lt;p&gt;The practical impact:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Enforcement date&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;April 2023&lt;/td&gt;
&lt;td&gt;Shift to autonomous management model — all substances with confirmed GHS hazard classifications brought progressively into scope&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;April 2024&lt;/td&gt;
&lt;td&gt;SDS must now specify concentration ranges numerically (not just qualitatively)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;April 2025&lt;/td&gt;
&lt;td&gt;Protective equipment mandatory for substances with skin/eye hazards&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;April 2027&lt;/td&gt;
&lt;td&gt;Risk assessment obligations expand to all regulated substances&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;With risk assessment coverage expanding significantly, companies need to process SDS data faster and more accurately. Manual PDF entry does not scale. The JSON schema is the infrastructure layer for automating this.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where existing tools stop
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Commercial SDS platforms
&lt;/h3&gt;

&lt;p&gt;The major SDS authoring platforms — Sphera, EcoOnline, Chemwatch, Verisk 3E — have broad international coverage. Japanese is typically a supported output language. What they do not provide, as far as I have found, is export to the MHLW JSON schema. They produce Word or PDF output in the correct section structure, which satisfies the document requirement but not the structured data exchange requirement.&lt;/p&gt;

&lt;p&gt;Japanese-market products like SDS Meister and SmartSDS support MHLW JSON output, but their PDF-to-JSON conversion coverage is limited — they are primarily SDS authoring tools, not bulk conversion tools for incoming supplier documents.&lt;/p&gt;

&lt;h3&gt;
  
  
  Open-source options
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;MHLW JSON&lt;/th&gt;
&lt;th&gt;PDF → JSON&lt;/th&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;sds_parser&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Regex, per-manufacturer rules&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;tungsten&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Rule-based, English-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;sds-converter&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Rust&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Yes&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;LLM-based extraction&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;sds_parser&lt;/code&gt; and &lt;code&gt;tungsten&lt;/code&gt; solve a different problem: extracting SDS data in English, for specific known manufacturer formats. Neither targets the MHLW schema.&lt;/p&gt;

&lt;h2&gt;
  
  
  The format inconsistency problem
&lt;/h2&gt;

&lt;p&gt;Even within JIS Z 7253-compliant documents, format varies by manufacturer:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source of variation&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Section heading labels&lt;/td&gt;
&lt;td&gt;"2. 危険有害性の要約" (JIS Z 7253) vs "2. Hazard(s) identification" (OSHA HazCom) vs "第2部分 危险性概述" (GB/T 16483) — all mean the same thing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section order&lt;/td&gt;
&lt;td&gt;The 16 sections can appear in any order the manufacturer chooses&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Concentration notation&lt;/td&gt;
&lt;td&gt;"≥95%", "1〜5%", "約100%", "企業秘密" (trade secret) all need different handling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Language mixing&lt;/td&gt;
&lt;td&gt;Japanese SDS documents regularly contain English chemical names and CAS numbers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A rule-based parser must enumerate every variant. In practice, manufacturer-specific headings add another layer of variation on top of the standard differences.&lt;/p&gt;

&lt;h2&gt;
  
  
  The schema itself
&lt;/h2&gt;

&lt;p&gt;Two properties of the MHLW schema are worth knowing before implementing against it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Section 3 (composition) is the hardest part
&lt;/h3&gt;

&lt;p&gt;Section 3 stores component information as a repeating array. Each component object has nested fields for chemical identity, concentration range, and hazard classification. The same data appears differently depending on whether the source document covers a pure substance, a mixture, or a trade secret formulation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Composition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"CompositionAndConcentration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ChemicalIdentity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"CASNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"64-17-5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ISHActNotificationNumber"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2-396"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"ConcentrationRange"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ConcentrationRangeFrom"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;95.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ConcentrationRangeTo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;100.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"ConcentrationRangeUnit"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"%"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"TradeSecretFlag"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Typos locked into v1.0
&lt;/h3&gt;

&lt;p&gt;The schema contains field name errors that are now part of the specification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;HumanExposureAndEmergencyMeasuress  ← trailing double-s
TestGuidline                        ← missing 'e' (not Guideline)
Desclaimer                          ← transposed letters (not Disclaimer)
gazetteNo                           ← lowercase first character
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Correcting these would break all existing implementations, so they cannot be fixed in v1.0. An implementation that normalizes these to standard English spellings will fail schema validation.&lt;/p&gt;

&lt;h2&gt;
  
  
  sds-converter
&lt;/h2&gt;

&lt;p&gt;I built &lt;a href="https://github.com/kent-tokyo/sds-converter" rel="noopener noreferrer"&gt;sds-converter&lt;/a&gt; to address the MHLW schema gap. It handles both directions: PDF/DOCX/XLSX to MHLW JSON, and MHLW JSON to a JIS Z 7253-compliant Word document.&lt;/p&gt;

&lt;p&gt;The core approach: rather than enumerating format variants with rules, the tool passes raw section text and the corresponding MHLW schema fields to an LLM and asks it to map values. The LLM handles heading label variation naturally. The output is validated against the schema before writing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;sds-converter

&lt;span class="c"&gt;# PDF → MHLW JSON&lt;/span&gt;
sds-converter to-json &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.json

&lt;span class="c"&gt;# MHLW JSON → JIS Z 7253 Word document&lt;/span&gt;
sds-converter to-docx &lt;span class="nt"&gt;--input&lt;/span&gt; output.json &lt;span class="nt"&gt;--output&lt;/span&gt; result.docx &lt;span class="nt"&gt;--lang&lt;/span&gt; ja
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM backend is pluggable — Claude, GPT, Gemini, Mistral, Groq, or local models via Ollama. A &lt;code&gt;--quality&lt;/code&gt; flag adjusts cost versus accuracy for batch workloads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Known limitations:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Issue&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scanned PDFs without a text layer&lt;/td&gt;
&lt;td&gt;Not supported — requires upstream OCR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Section 3 tables with merged cells&lt;/td&gt;
&lt;td&gt;Extraction sometimes fails on complex DOCX layouts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision fields mixed with "not measured" entries&lt;/td&gt;
&lt;td&gt;Occasional type errors in Section 9 output&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are open problems, not design decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The open gap
&lt;/h2&gt;

&lt;p&gt;The MHLW schema represents a real need for anyone handling chemical compliance in Japan at volume. Commercial tools cover the authoring side; the bulk conversion of incoming supplier PDFs to structured data has no open-source solution targeting this schema — other than sds-converter, which I developed and which is the only implementation I am aware of.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/kent-tokyo/sds-converter" rel="noopener noreferrer"&gt;repository&lt;/a&gt; is open. Contributions on the extraction side — particularly Section 3 table handling — are welcome. If you work in cheminformatics or chemical compliance and have approached the MHLW compliance problem differently, I would be interested to hear it.&lt;/p&gt;

</description>
      <category>chemistry</category>
      <category>rust</category>
      <category>opensource</category>
      <category>cheminformatics</category>
    </item>
    <item>
      <title>sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs</title>
      <dc:creator>kent-tokyo</dc:creator>
      <pubDate>Fri, 22 May 2026 23:09:11 +0000</pubDate>
      <link>https://dev.to/kent-tokyo/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and-llms-ihg</link>
      <guid>https://dev.to/kent-tokyo/sds-converter-converting-safety-data-sheets-to-mhlw-standard-json-with-rust-and-llms-ihg</guid>
      <description>&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.&lt;/p&gt;

&lt;p&gt;The Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.&lt;/p&gt;

&lt;p&gt;The problem is that real SDS documents don't arrive structured to this schema.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why SDS documents are hard to parse
&lt;/h2&gt;

&lt;p&gt;Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Section order&lt;/strong&gt; — manufacturers arrange the 16 sections freely within the standard&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Field labeling&lt;/strong&gt; — the same data appears under different headings across JIS Z 7253, GHS/OSHA HazCom, GB/T 16483, CNS 15030, and company-specific layouts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Value representation&lt;/strong&gt; — &lt;code&gt;"≥99.5%"&lt;/code&gt;, &lt;code&gt;"99.5% or higher"&lt;/code&gt;, &lt;code&gt;"approximately 100%"&lt;/code&gt; all mean the same thing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language mixing&lt;/strong&gt; — Japanese SDS regularly embed English chemical names and CAS numbers mid-sentence&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implicit information&lt;/strong&gt; — section 9 (physical/chemical properties) often has half its fields missing because manufacturers only fill in what's relevant to their product&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The MHLW schema compounds this: it has intentional typos that must be reproduced exactly. &lt;code&gt;HumanExposureAndEmergencyMeasuress&lt;/code&gt; ends in double-&lt;code&gt;s&lt;/code&gt;. &lt;code&gt;TestGuidline&lt;/code&gt; is missing an &lt;code&gt;e&lt;/code&gt;. &lt;code&gt;Desclaimer&lt;/code&gt; has transposed letters. These are in the official spec, and validation fails if you "fix" them.&lt;/p&gt;

&lt;p&gt;To handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built &lt;a href="https://github.com/kent-tokyo/sds-converter" rel="noopener noreferrer"&gt;sds-converter&lt;/a&gt; to handle this with an LLM instead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 16 sections
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;#&lt;/th&gt;
&lt;th&gt;Schema key&lt;/th&gt;
&lt;th&gt;JIS Z 7253 section&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Identification&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Chemical identity and company information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HazardIdentification&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hazard identification&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Composition&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Composition / information on ingredients&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FirstAidMeasures&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;First-aid measures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;&lt;code&gt;FireFightingMeasures&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fire-fighting measures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;&lt;code&gt;AccidentalReleaseMeasures&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Accidental release measures&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HandlingAndStorage&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Handling and storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ExposureControlPersonalProtection&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Exposure controls / personal protection&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;&lt;code&gt;PhysicalChemicalProperties&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Physical and chemical properties&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;&lt;code&gt;StabilityReactivity&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Stability and reactivity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ToxicologicalInformation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Toxicological information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;&lt;code&gt;EcologicalInformation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ecological information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DisposalConsiderations&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disposal considerations&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;&lt;code&gt;TransportInformation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Transport information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;&lt;code&gt;RegulatoryInformation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Regulatory information&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OtherInformation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Other information&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Installation and quick start
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cargo &lt;span class="nb"&gt;install &lt;/span&gt;sds-converter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# PDF → MHLW standard JSON&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ANTHROPIC_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;sk-ant-...
sds-converter to-json &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.json

&lt;span class="c"&gt;# MHLW JSON → JIS Z 7253-compliant Word document&lt;/span&gt;
sds-converter to-docx &lt;span class="nt"&gt;--input&lt;/span&gt; output.json &lt;span class="nt"&gt;--output&lt;/span&gt; result.docx &lt;span class="nt"&gt;--lang&lt;/span&gt; ja

&lt;span class="c"&gt;# Schema validation&lt;/span&gt;
sds-converter validate &lt;span class="nt"&gt;--input&lt;/span&gt; output.json

&lt;span class="c"&gt;# Extract raw text (no LLM call — useful for debugging)&lt;/span&gt;
sds-converter extract-text &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Supported input: PDF, DOCX, XLSX, TXT.&lt;/p&gt;




&lt;h2&gt;
  
  
  How the conversion works
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Step 1: Text extraction
&lt;/h3&gt;

&lt;p&gt;Text is pulled from the PDF or DOCX file. Use &lt;code&gt;extract-text&lt;/code&gt; to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Parallel LLM extraction
&lt;/h3&gt;

&lt;p&gt;The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GROUP_A&lt;/strong&gt; (sections 1–9): identification, hazard, composition, first aid, fire fighting, accidental release, handling, exposure, physical properties&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GROUP_B&lt;/strong&gt; (sections 10–16): stability, toxicology, ecological, disposal, transport, regulatory, other&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: JSON output
&lt;/h3&gt;

&lt;p&gt;The merged result is written as MHLW SDS data exchange format v1.0 JSON.&lt;/p&gt;




&lt;h2&gt;
  
  
  LLM backend and quality settings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choosing a provider
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# OpenAI GPT (gpt-4o-mini by default)&lt;/span&gt;
sds-converter to-json &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--provider&lt;/span&gt; openai &lt;span class="nt"&gt;--api-key&lt;/span&gt; &lt;span class="nv"&gt;$OPENAI_API_KEY&lt;/span&gt;

&lt;span class="c"&gt;# Google Gemini (gemini-2.0-flash by default)&lt;/span&gt;
sds-converter to-json &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--provider&lt;/span&gt; gemini &lt;span class="nt"&gt;--api-key&lt;/span&gt; &lt;span class="nv"&gt;$GEMINI_API_KEY&lt;/span&gt;

&lt;span class="c"&gt;# Local LLM via Ollama (any OpenAI-compatible endpoint)&lt;/span&gt;
sds-converter to-json &lt;span class="nt"&gt;--input&lt;/span&gt; input.pdf &lt;span class="nt"&gt;--output&lt;/span&gt; output.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--provider&lt;/span&gt; &lt;span class="nb"&gt;local&lt;/span&gt; &lt;span class="nt"&gt;--base-url&lt;/span&gt; http://localhost:11434/v1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--model&lt;/span&gt; llama3.2 &lt;span class="nt"&gt;--api-key&lt;/span&gt; dummy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;--provider&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Default model&lt;/th&gt;
&lt;th&gt;Environment variable&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;anthropic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;claude-haiku-4-5-20251001&lt;/code&gt; (low/medium) · &lt;code&gt;claude-sonnet-4-6&lt;/code&gt; (high)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;openai&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gpt-4o-mini&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;OPENAI_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gemini&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gemini-2.0-flash&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GEMINI_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mistral&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;mistral-small-latest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;MISTRAL_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;groq&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;llama-3.3-70b-versatile&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;GROQ_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;cohere&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;command-r-plus&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;COHERE_API_KEY&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;local&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;llama3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;LOCAL_LLM_API_KEY&lt;/code&gt; (optional)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Quality preset
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;--quality&lt;/code&gt; controls both the model and how much text is sent to the LLM per call:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;code&gt;--quality&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Model (Anthropic)&lt;/th&gt;
&lt;th&gt;Max text fed to LLM&lt;/th&gt;
&lt;th&gt;Use case&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;low&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;claude-haiku-4-5&lt;/td&gt;
&lt;td&gt;15,000 chars&lt;/td&gt;
&lt;td&gt;Speed/cost priority&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;medium&lt;/code&gt; (default)&lt;/td&gt;
&lt;td&gt;claude-haiku-4-5&lt;/td&gt;
&lt;td&gt;30,000 chars&lt;/td&gt;
&lt;td&gt;Balanced&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;high&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;claude-sonnet-4-6&lt;/td&gt;
&lt;td&gt;60,000 chars&lt;/td&gt;
&lt;td&gt;Accuracy priority&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At &lt;code&gt;high&lt;/code&gt;, the full document text including the later sections (transport information, regulatory) is included. Use &lt;code&gt;--quality high&lt;/code&gt; when complete 16-section coverage matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Batch mode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sds-converter to-json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--input-dir&lt;/span&gt; ./pdfs/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-dir&lt;/span&gt; ./json/ &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--lang&lt;/span&gt; ja &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--concurrency&lt;/span&gt; 4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Validation
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;validate&lt;/code&gt; checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;sds-converter validate &lt;span class="nt"&gt;--input&lt;/span&gt; output.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examples of what it checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Section 1: no product name (TradeNameJP or TradeNameEN)&lt;/li&gt;
&lt;li&gt;Section 1: SupplierInformation missing&lt;/li&gt;
&lt;li&gt;Section 2: neither Classification nor HazardLabelling extracted&lt;/li&gt;
&lt;li&gt;Section 3: CompositionAndConcentration list is empty&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When using the library, &lt;code&gt;convert_to_json&lt;/code&gt; returns a &lt;code&gt;(SdsRoot, Vec&amp;lt;String&amp;gt;)&lt;/code&gt; tuple — the warnings are surfaced inline.&lt;/p&gt;




&lt;h2&gt;
  
  
  Output JSON structure
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Datasheet"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"IssueDate"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2024-03-31"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SDS-SchemaVersionNo"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1.0"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"Identification"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"TradeProductIdentity"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"TradeNameJP"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sample Product"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"SupplierInformation"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"CompanyName"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sample Corp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"Phone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"03-0000-0000"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the &lt;a href="https://www.mhlw.go.jp/stf/newpage_56484.html" rel="noopener noreferrer"&gt;MHLW website&lt;/a&gt; (Japanese).&lt;/p&gt;




&lt;h2&gt;
  
  
  Using as a library
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[dependencies]&lt;/span&gt;
&lt;span class="py"&gt;sds-converter-core&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"0.1"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  PDF → JSON
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;sds_converter_core&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;
    &lt;span class="nn"&gt;converter&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;AnthropicBackend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LlmConfig&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;convert_to_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConvertConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="nd"&gt;#[tokio::main]&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;anyhow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;AnthropicBackend&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;env&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;var&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="nn"&gt;LlmConfig&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ConvertConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;source_language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;Some&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nn"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Japanese&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;output_language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Japanese&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;warnings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;convert_to_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"input.pdf"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;warnings&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nd"&gt;eprintln!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"WARN: {w}"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;to_string_pretty&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  JSON → Word document
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;sds_converter_core&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;convert_from_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ConvertConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SdsRoot&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;anyhow&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;sds&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;SdsRoot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;serde_json&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;from_str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;read_to_string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"output.json"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ConvertConfig&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;output_language&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nn"&gt;Language&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Japanese&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="o"&gt;..&lt;/span&gt;&lt;span class="nn"&gt;Default&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;default&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
    &lt;span class="nf"&gt;convert_from_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;new&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"result.docx"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(())&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Custom LLM backend
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;use&lt;/span&gt; &lt;span class="nn"&gt;sds_converter_core&lt;/span&gt;&lt;span class="p"&gt;::{&lt;/span&gt;&lt;span class="n"&gt;LlmBackend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SdsError&lt;/span&gt;&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="n"&gt;MyBackend&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;impl&lt;/span&gt; &lt;span class="n"&gt;LlmBackend&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;MyBackend&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;Result&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nb"&gt;String&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SdsError&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="c1"&gt;// Call your LLM API, return the raw JSON string response&lt;/span&gt;
        &lt;span class="nd"&gt;todo!&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Language support
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Language&lt;/th&gt;
&lt;th&gt;&lt;code&gt;--lang&lt;/code&gt;&lt;/th&gt;
&lt;th&gt;Source standard&lt;/th&gt;
&lt;th&gt;Output DOCX headings&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Japanese&lt;/td&gt;
&lt;td&gt;&lt;code&gt;ja&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;JIS Z 7253&lt;/td&gt;
&lt;td&gt;JIS Z 7253&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;English&lt;/td&gt;
&lt;td&gt;&lt;code&gt;en&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GHS/OSHA HazCom&lt;/td&gt;
&lt;td&gt;GHS Rev.10 / ISO 11014&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Simplified Chinese&lt;/td&gt;
&lt;td&gt;&lt;code&gt;zh-cn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;GB/T 16483-2012&lt;/td&gt;
&lt;td&gt;GB/T 16483-2012&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Traditional Chinese&lt;/td&gt;
&lt;td&gt;&lt;code&gt;zh-tw&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;CNS 15030&lt;/td&gt;
&lt;td&gt;CNS 15030&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Comparison with alternatives
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Open-source
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;sds-converter&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://github.com/astepe/sds_parser" rel="noopener noreferrer"&gt;sds_parser&lt;/a&gt;&lt;/th&gt;
&lt;th&gt;&lt;a href="https://github.com/CrucibleSDS/tungsten" rel="noopener noreferrer"&gt;tungsten&lt;/a&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Language&lt;/td&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;td&gt;Python&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI/LLM&lt;/td&gt;
&lt;td&gt;Yes (pluggable)&lt;/td&gt;
&lt;td&gt;No (regex)&lt;/td&gt;
&lt;td&gt;No (rule-based)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MHLW JSON&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bidirectional&lt;/td&gt;
&lt;td&gt;Yes (↔ DOCX)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multilingual&lt;/td&gt;
&lt;td&gt;ja / en / zh-CN / zh-TW&lt;/td&gt;
&lt;td&gt;Limited&lt;/td&gt;
&lt;td&gt;English only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Commercial (Japan)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;sds-converter&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;SDS Meister&lt;/th&gt;
&lt;th&gt;SmartSDS&lt;/th&gt;
&lt;th&gt;Dr.EHS Chemical&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AI&lt;/td&gt;
&lt;td&gt;Yes (your API key)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;Yes (translation)&lt;/td&gt;
&lt;td&gt;AI-OCR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MHLW JSON&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PDF → JSON&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (authoring only)&lt;/td&gt;
&lt;td&gt;Partial (JP only)&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open-source&lt;/td&gt;
&lt;td&gt;MIT/Apache-2.0&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;sds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.&lt;/p&gt;




&lt;h2&gt;
  
  
  Crate structure
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sds-converter-core&lt;/code&gt;&lt;/strong&gt; — library. LLM extraction, DOCX generation, MHLW schema types.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;sds-converter&lt;/code&gt;&lt;/strong&gt; — CLI binary. &lt;code&gt;to-json&lt;/code&gt;, &lt;code&gt;to-docx&lt;/code&gt;, &lt;code&gt;validate&lt;/code&gt;, &lt;code&gt;extract-text&lt;/code&gt; subcommands.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kent-tokyo/sds-converter" rel="noopener noreferrer"&gt;https://github.com/kent-tokyo/sds-converter&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>chemistry</category>
      <category>rust</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
