DEV Community

kent-tokyo
kent-tokyo

Posted on

sds-converter: Converting Safety Data Sheets to MHLW Standard JSON with Rust and LLMs

Background

Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.

The Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.

The problem is that real SDS documents don't arrive structured to this schema.


Why SDS documents are hard to parse

Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:

  • Section order — manufacturers arrange the 16 sections freely within the standard
  • Field labeling — the same data appears under different headings across JIS Z 7253, GHS/OSHA HazCom, GB/T 16483, CNS 15030, and company-specific layouts
  • Value representation"≥99.5%", "99.5% or higher", "approximately 100%" all mean the same thing
  • Language mixing — Japanese SDS regularly embed English chemical names and CAS numbers mid-sentence
  • Implicit information — section 9 (physical/chemical properties) often has half its fields missing because manufacturers only fill in what's relevant to their product

The MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress ends in double-s. TestGuidline is missing an e. Desclaimer has transposed letters. These are in the official spec, and validation fails if you "fix" them.

To handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead.


The 16 sections

# Schema key JIS Z 7253 section
1 Identification Chemical identity and company information
2 HazardIdentification Hazard identification
3 Composition Composition / information on ingredients
4 FirstAidMeasures First-aid measures
5 FireFightingMeasures Fire-fighting measures
6 AccidentalReleaseMeasures Accidental release measures
7 HandlingAndStorage Handling and storage
8 ExposureControlPersonalProtection Exposure controls / personal protection
9 PhysicalChemicalProperties Physical and chemical properties
10 StabilityReactivity Stability and reactivity
11 ToxicologicalInformation Toxicological information
12 EcologicalInformation Ecological information
13 DisposalConsiderations Disposal considerations
14 TransportInformation Transport information
15 RegulatoryInformation Regulatory information
16 OtherInformation Other information

Installation and quick start

cargo install sds-converter
Enter fullscreen mode Exit fullscreen mode
# PDF → MHLW standard JSON
export ANTHROPIC_API_KEY=sk-ant-...
sds-converter to-json --input input.pdf --output output.json

# MHLW JSON → JIS Z 7253-compliant Word document
sds-converter to-docx --input output.json --output result.docx --lang ja

# Schema validation
sds-converter validate --input output.json

# Extract raw text (no LLM call — useful for debugging)
sds-converter extract-text --input input.pdf
Enter fullscreen mode Exit fullscreen mode

Supported input: PDF, DOCX, XLSX, TXT.


How the conversion works

Step 1: Text extraction

Text is pulled from the PDF or DOCX file. Use extract-text to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.

Note: Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.

Step 2: Parallel LLM extraction

The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:

  • GROUP_A (sections 1–9): identification, hazard, composition, first aid, fire fighting, accidental release, handling, exposure, physical properties
  • GROUP_B (sections 10–16): stability, toxicology, ecological, disposal, transport, regulatory, other

Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).

Step 3: JSON output

The merged result is written as MHLW SDS data exchange format v1.0 JSON.


LLM backend and quality settings

Choosing a provider

# OpenAI GPT (gpt-4o-mini by default)
sds-converter to-json --input input.pdf --output output.json \
  --provider openai --api-key $OPENAI_API_KEY

# Google Gemini (gemini-2.0-flash by default)
sds-converter to-json --input input.pdf --output output.json \
  --provider gemini --api-key $GEMINI_API_KEY

# Local LLM via Ollama (any OpenAI-compatible endpoint)
sds-converter to-json --input input.pdf --output output.json \
  --provider local --base-url http://localhost:11434/v1 \
  --model llama3.2 --api-key dummy
Enter fullscreen mode Exit fullscreen mode
--provider Default model Environment variable
anthropic claude-haiku-4-5-20251001 (low/medium) · claude-sonnet-4-6 (high) ANTHROPIC_API_KEY
openai gpt-4o-mini OPENAI_API_KEY
gemini gemini-2.0-flash GEMINI_API_KEY
mistral mistral-small-latest MISTRAL_API_KEY
groq llama-3.3-70b-versatile GROQ_API_KEY
cohere command-r-plus COHERE_API_KEY
local llama3 LOCAL_LLM_API_KEY (optional)

Quality preset

--quality controls both the model and how much text is sent to the LLM per call:

--quality Model (Anthropic) Max text fed to LLM Use case
low claude-haiku-4-5 15,000 chars Speed/cost priority
medium (default) claude-haiku-4-5 30,000 chars Balanced
high claude-sonnet-4-6 60,000 chars Accuracy priority

At high, the full document text including the later sections (transport information, regulatory) is included. Use --quality high when complete 16-section coverage matters.

Batch mode

sds-converter to-json \
  --input-dir ./pdfs/ \
  --output-dir ./json/ \
  --lang ja \
  --concurrency 4
Enter fullscreen mode Exit fullscreen mode

Validation

validate checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.

sds-converter validate --input output.json
Enter fullscreen mode Exit fullscreen mode

Examples of what it checks:

  • Section 1: no product name (TradeNameJP or TradeNameEN)
  • Section 1: SupplierInformation missing
  • Section 2: neither Classification nor HazardLabelling extracted
  • Section 3: CompositionAndConcentration list is empty

When using the library, convert_to_json returns a (SdsRoot, Vec<String>) tuple — the warnings are surfaced inline.


Output JSON structure

{
  "Datasheet": {
    "IssueDate": "2024-03-31",
    "SDS-SchemaVersionNo": "1.0"
  },
  "Identification": {
    "TradeProductIdentity": {
      "TradeNameJP": "Sample Product"
    },
    "SupplierInformation": {
      "CompanyName": "Sample Corp",
      "Phone": "03-0000-0000"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the MHLW website (Japanese).


Using as a library

[dependencies]
sds-converter-core = "0.1"
Enter fullscreen mode Exit fullscreen mode

PDF → JSON

use sds_converter_core::{
    converter::{AnthropicBackend, LlmConfig},
    convert_to_json, ConvertConfig, Language,
};

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let backend = AnthropicBackend::new(
        std::env::var("ANTHROPIC_API_KEY")?,
        LlmConfig::default(),
    );
    let config = ConvertConfig {
        source_language: Some(Language::Japanese),
        output_language: Language::Japanese,
        ..Default::default()
    };
    let (sds, warnings) = convert_to_json(
        std::path::Path::new("input.pdf"), &backend, &config
    ).await?;
    for w in &warnings { eprintln!("WARN: {w}"); }
    std::fs::write("output.json", serde_json::to_string_pretty(&sds)?)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

JSON → Word document

use sds_converter_core::{convert_from_json, ConvertConfig, Language, SdsRoot};

fn main() -> anyhow::Result<()> {
    let sds: SdsRoot = serde_json::from_str(&std::fs::read_to_string("output.json")?)?;
    let config = ConvertConfig {
        output_language: Language::Japanese,
        ..Default::default()
    };
    convert_from_json(&sds, std::path::Path::new("result.docx"), &config)?;
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Custom LLM backend

use sds_converter_core::{LlmBackend, SdsError};

struct MyBackend;

impl LlmBackend for MyBackend {
    async fn complete(&self, system: &str, user: &str) -> Result<String, SdsError> {
        // Call your LLM API, return the raw JSON string response
        todo!()
    }
}
Enter fullscreen mode Exit fullscreen mode

Language support

Language --lang Source standard Output DOCX headings
Japanese ja JIS Z 7253 JIS Z 7253
English en GHS/OSHA HazCom GHS Rev.10 / ISO 11014
Simplified Chinese zh-cn GB/T 16483-2012 GB/T 16483-2012
Traditional Chinese zh-tw CNS 15030 CNS 15030

Comparison with alternatives

Open-source

sds-converter sds_parser tungsten
Language Rust Python Python
AI/LLM Yes (pluggable) No (regex) No (rule-based)
MHLW JSON Yes No No
Bidirectional Yes (↔ DOCX) No No
Multilingual ja / en / zh-CN / zh-TW Limited English only

Commercial (Japan)

sds-converter SDS Meister SmartSDS Dr.EHS Chemical
AI Yes (your API key) No Yes (translation) AI-OCR
MHLW JSON Yes Yes Yes Yes
PDF → JSON Yes No (authoring only) Partial (JP only) Yes
Open-source MIT/Apache-2.0 No No No

sds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.


Crate structure

  • sds-converter-core — library. LLM extraction, DOCX generation, MHLW schema types.
  • sds-converter — CLI binary. to-json, to-docx, validate, extract-text subcommands.

Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.

https://github.com/kent-tokyo/sds-converter

Top comments (0)