kent-tokyo

Posted on May 22

sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs

#ai #chemistry #rust #opensource

Background

Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.

The Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.

The problem is that real SDS documents don't arrive structured to this schema.

Why SDS documents are hard to parse

Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:

Section order — manufacturers arrange the 16 sections freely within the standard
Field labeling — the same data appears under different headings across JIS Z 7253, GHS/OSHA HazCom, GB/T 16483, CNS 15030, and company-specific layouts
Value representation — "≥99.5%", "99.5% or higher", "approximately 100%" all mean the same thing
Language mixing — Japanese SDS regularly embed English chemical names and CAS numbers mid-sentence
Implicit information — section 9 (physical/chemical properties) often has half its fields missing because manufacturers only fill in what's relevant to their product

The MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress ends in double-s. TestGuidline is missing an e. Desclaimer has transposed letters. These are in the official spec, and validation fails if you "fix" them.

To handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead.

The 16 sections

#	Schema key	JIS Z 7253 section
1	`Identification`	Chemical identity and company information
2	`HazardIdentification`	Hazard identification
3	`Composition`	Composition / information on ingredients
4	`FirstAidMeasures`	First-aid measures
5	`FireFightingMeasures`	Fire-fighting measures
6	`AccidentalReleaseMeasures`	Accidental release measures
7	`HandlingAndStorage`	Handling and storage
8	`ExposureControlPersonalProtection`	Exposure controls / personal protection
9	`PhysicalChemicalProperties`	Physical and chemical properties
10	`StabilityReactivity`	Stability and reactivity
11	`ToxicologicalInformation`	Toxicological information
12	`EcologicalInformation`	Ecological information
13	`DisposalConsiderations`	Disposal considerations
14	`TransportInformation`	Transport information
15	`RegulatoryInformation`	Regulatory information
16	`OtherInformation`	Other information

Installation and quick start

cargo install sds-converter

# PDF → MHLW standard JSON
export ANTHROPIC_API_KEY=sk-ant-...
sds-converter to-json --input input.pdf --output output.json

# MHLW JSON → JIS Z 7253-compliant Word document
sds-converter to-docx --input output.json --output result.docx --lang ja

# Schema validation
sds-converter validate --input output.json

# Extract raw text (no LLM call — useful for debugging)
sds-converter extract-text --input input.pdf

Supported input: PDF, DOCX, XLSX, TXT.

How the conversion works

Step 1: Text extraction

Text is pulled from the PDF or DOCX file. Use extract-text to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.

Note: Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.

Step 2: Parallel LLM extraction

The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:

GROUP_A (sections 1–9): identification, hazard, composition, first aid, fire fighting, accidental release, handling, exposure, physical properties
GROUP_B (sections 10–16): stability, toxicology, ecological, disposal, transport, regulatory, other

Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).

Step 3: JSON output

The merged result is written as MHLW SDS data exchange format v1.0 JSON.

LLM backend and quality settings

Choosing a provider

# OpenAI GPT (gpt-4o-mini by default)
sds-converter to-json --input input.pdf --output output.json \
  --provider openai --api-key $OPENAI_API_KEY

# Google Gemini (gemini-2.0-flash by default)
sds-converter to-json --input input.pdf --output output.json \
  --provider gemini --api-key $GEMINI_API_KEY

# Local LLM via Ollama (any OpenAI-compatible endpoint)
sds-converter to-json --input input.pdf --output output.json \
  --provider local --base-url http://localhost:11434/v1 \
  --model llama3.2 --api-key dummy

`--provider`	Default model	Environment variable
`anthropic`	`claude-haiku-4-5-20251001` (low/medium) · `claude-sonnet-4-6` (high)	`ANTHROPIC_API_KEY`
`openai`	`gpt-4o-mini`	`OPENAI_API_KEY`
`gemini`	`gemini-2.0-flash`	`GEMINI_API_KEY`
`mistral`	`mistral-small-latest`	`MISTRAL_API_KEY`
`groq`	`llama-3.3-70b-versatile`	`GROQ_API_KEY`
`cohere`	`command-r-plus`	`COHERE_API_KEY`
`local`	`llama3`	`LOCAL_LLM_API_KEY` (optional)

Quality preset

--quality controls both the model and how much text is sent to the LLM per call:

`--quality`	Model (Anthropic)	Max text fed to LLM	Use case
`low`	claude-haiku-4-5	15,000 chars	Speed/cost priority
`medium` (default)	claude-haiku-4-5	30,000 chars	Balanced
`high`	claude-sonnet-4-6	60,000 chars	Accuracy priority

At high, the full document text including the later sections (transport information, regulatory) is included. Use --quality high when complete 16-section coverage matters.

Batch mode

sds-converter to-json \
  --input-dir ./pdfs/ \
  --output-dir ./json/ \
  --lang ja \
  --concurrency 4

Validation

validate checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.

sds-converter validate --input output.json

Examples of what it checks:

Section 1: no product name (TradeNameJP or TradeNameEN)
Section 1: SupplierInformation missing
Section 2: neither Classification nor HazardLabelling extracted
Section 3: CompositionAndConcentration list is empty

When using the library, convert_to_json returns a (SdsRoot, Vec<String>) tuple — the warnings are surfaced inline.

Output JSON structure

{
  "Datasheet": {
    "IssueDate": "2024-03-31",
    "SDS-SchemaVersionNo": "1.0"
  },
  "Identification": {
    "TradeProductIdentity": {
      "TradeNameJP": "Sample Product"
    },
    "SupplierInformation": {
      "CompanyName": "Sample Corp",
      "Phone": "03-0000-0000"
    }
  }
}

The full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the MHLW website (Japanese).

Using as a library

[dependencies]
sds-converter-core = "0.1"

PDF → JSON

use sds_converter_core::{
    converter::{AnthropicBackend, LlmConfig},
    convert_to_json, ConvertConfig, Language,
};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let backend = AnthropicBackend::new(
        std::env::var("ANTHROPIC_API_KEY")?,
        LlmConfig::default(),
    );
    let config = ConvertConfig {
        source_language: Some(Language::Japanese),
        output_language: Language::Japanese,
        ..Default::default()
    };
    let (sds, warnings) = convert_to_json(
        std::path::Path::new("input.pdf"), &backend, &config
    ).await?;
    for w in &warnings { eprintln!("WARN: {w}"); }
    std::fs::write("output.json", serde_json::to_string_pretty(&sds)?)?;
    Ok(())
}

JSON → Word document

use sds_converter_core::{convert_from_json, ConvertConfig, Language, SdsRoot};

fn main() -> anyhow::Result<()> {
    let sds: SdsRoot = serde_json::from_str(&std::fs::read_to_string("output.json")?)?;
    let config = ConvertConfig {
        output_language: Language::Japanese,
        ..Default::default()
    };
    convert_from_json(&sds, std::path::Path::new("result.docx"), &config)?;
    Ok(())
}

Custom LLM backend

use sds_converter_core::{LlmBackend, SdsError};

struct MyBackend;

impl LlmBackend for MyBackend {
    async fn complete(&self, system: &str, user: &str) -> Result<String, SdsError> {
        // Call your LLM API, return the raw JSON string response
        todo!()
    }
}

Language support

Language	`--lang`	Source standard	Output DOCX headings
Japanese	`ja`	JIS Z 7253	JIS Z 7253
English	`en`	GHS/OSHA HazCom	GHS Rev.10 / ISO 11014
Simplified Chinese	`zh-cn`	GB/T 16483-2012	GB/T 16483-2012
Traditional Chinese	`zh-tw`	CNS 15030	CNS 15030

Comparison with alternatives

Open-source

	sds-converter	sds_parser	tungsten
Language	Rust	Python	Python
AI/LLM	Yes (pluggable)	No (regex)	No (rule-based)
MHLW JSON	Yes	No	No
Bidirectional	Yes (↔ DOCX)	No	No
Multilingual	ja / en / zh-CN / zh-TW	Limited	English only

Commercial (Japan)

	sds-converter	SDS Meister	SmartSDS	Dr.EHS Chemical
AI	Yes (your API key)	No	Yes (translation)	AI-OCR
MHLW JSON	Yes	Yes	Yes	Yes
PDF → JSON	Yes	No (authoring only)	Partial (JP only)	Yes
Open-source	MIT/Apache-2.0	No	No	No

sds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.

Crate structure

sds-converter-core — library. LLM extraction, DOCX generation, MHLW schema types.
sds-converter — CLI binary. to-json, to-docx, validate, extract-text subcommands.

Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.

https://github.com/kent-tokyo/sds-converter

DEV Community