Background
Safety Data Sheets (SDS) are mandatory documents for every chemical product — solvents, adhesives, industrial gases, cleaning agents. Every manufacturer that supplies a hazardous chemical must provide one. In Japan, the governing standard is JIS Z 7253, which defines 16 sections covering chemical identity, hazard classification, first aid, storage, transport information, and more.
The Ministry of Health, Labour and Welfare (MHLW) published a standard JSON schema in March 2025 for electronic SDS data exchange between chemical management systems. The schema has roughly 200 deeply nested fields covering all 16 sections.
The problem is that real SDS documents don't arrive structured to this schema.
Why SDS documents are hard to parse
Even two documents both compliant with JIS Z 7253 will differ in ways that break rule-based parsers:
- Section order — manufacturers arrange the 16 sections freely within the standard
- Field labeling — the same data appears under different headings across JIS Z 7253, GHS/OSHA HazCom, GB/T 16483, CNS 15030, and company-specific layouts
-
Value representation —
"≥99.5%","99.5% or higher","approximately 100%"all mean the same thing - Language mixing — Japanese SDS regularly embed English chemical names and CAS numbers mid-sentence
- Implicit information — section 9 (physical/chemical properties) often has half its fields missing because manufacturers only fill in what's relevant to their product
The MHLW schema compounds this: it has intentional typos that must be reproduced exactly. HumanExposureAndEmergencyMeasuress ends in double-s. TestGuidline is missing an e. Desclaimer has transposed letters. These are in the official spec, and validation fails if you "fix" them.
To handle SDS from international manufacturers (GHS/OSHA format) or Chinese suppliers (GB/T 16483 format) in the same pipeline, you'd need separate parsers for each format. Writing and maintaining those is impractical. I built sds-converter to handle this with an LLM instead.
The 16 sections
| # | Schema key | JIS Z 7253 section |
|---|---|---|
| 1 | Identification |
Chemical identity and company information |
| 2 | HazardIdentification |
Hazard identification |
| 3 | Composition |
Composition / information on ingredients |
| 4 | FirstAidMeasures |
First-aid measures |
| 5 | FireFightingMeasures |
Fire-fighting measures |
| 6 | AccidentalReleaseMeasures |
Accidental release measures |
| 7 | HandlingAndStorage |
Handling and storage |
| 8 | ExposureControlPersonalProtection |
Exposure controls / personal protection |
| 9 | PhysicalChemicalProperties |
Physical and chemical properties |
| 10 | StabilityReactivity |
Stability and reactivity |
| 11 | ToxicologicalInformation |
Toxicological information |
| 12 | EcologicalInformation |
Ecological information |
| 13 | DisposalConsiderations |
Disposal considerations |
| 14 | TransportInformation |
Transport information |
| 15 | RegulatoryInformation |
Regulatory information |
| 16 | OtherInformation |
Other information |
Installation and quick start
cargo install sds-converter
# PDF → MHLW standard JSON
export ANTHROPIC_API_KEY=sk-ant-...
sds-converter to-json --input input.pdf --output output.json
# MHLW JSON → JIS Z 7253-compliant Word document
sds-converter to-docx --input output.json --output result.docx --lang ja
# Schema validation
sds-converter validate --input output.json
# Extract raw text (no LLM call — useful for debugging)
sds-converter extract-text --input input.pdf
Supported input: PDF, DOCX, XLSX, TXT.
How the conversion works
Step 1: Text extraction
Text is pulled from the PDF or DOCX file. Use extract-text to inspect exactly what gets sent to the LLM — useful when extraction quality is lower than expected.
Note: Encrypted PDFs and scan-only (image) PDFs are not supported — text extraction requires selectable text.
Step 2: Parallel LLM extraction
The 16 sections are split into two groups and extracted with two parallel LLM calls, halving per-file latency:
- GROUP_A (sections 1–9): identification, hazard, composition, first aid, fire fighting, accidental release, handling, exposure, physical properties
- GROUP_B (sections 10–16): stability, toxicology, ecological, disposal, transport, regulatory, other
Results from both calls are merged. Sections skipped in the first pass are automatically retried. HTTP rate-limit responses (429/529) trigger exponential backoff retries (2s → 4s → 8s, up to 3 attempts).
Step 3: JSON output
The merged result is written as MHLW SDS data exchange format v1.0 JSON.
LLM backend and quality settings
Choosing a provider
# OpenAI GPT (gpt-4o-mini by default)
sds-converter to-json --input input.pdf --output output.json \
--provider openai --api-key $OPENAI_API_KEY
# Google Gemini (gemini-2.0-flash by default)
sds-converter to-json --input input.pdf --output output.json \
--provider gemini --api-key $GEMINI_API_KEY
# Local LLM via Ollama (any OpenAI-compatible endpoint)
sds-converter to-json --input input.pdf --output output.json \
--provider local --base-url http://localhost:11434/v1 \
--model llama3.2 --api-key dummy
--provider |
Default model | Environment variable |
|---|---|---|
anthropic |
claude-haiku-4-5-20251001 (low/medium) · claude-sonnet-4-6 (high) |
ANTHROPIC_API_KEY |
openai |
gpt-4o-mini |
OPENAI_API_KEY |
gemini |
gemini-2.0-flash |
GEMINI_API_KEY |
mistral |
mistral-small-latest |
MISTRAL_API_KEY |
groq |
llama-3.3-70b-versatile |
GROQ_API_KEY |
cohere |
command-r-plus |
COHERE_API_KEY |
local |
llama3 |
LOCAL_LLM_API_KEY (optional) |
Quality preset
--quality controls both the model and how much text is sent to the LLM per call:
--quality |
Model (Anthropic) | Max text fed to LLM | Use case |
|---|---|---|---|
low |
claude-haiku-4-5 | 15,000 chars | Speed/cost priority |
medium (default) |
claude-haiku-4-5 | 30,000 chars | Balanced |
high |
claude-sonnet-4-6 | 60,000 chars | Accuracy priority |
At high, the full document text including the later sections (transport information, regulatory) is included. Use --quality high when complete 16-section coverage matters.
Batch mode
sds-converter to-json \
--input-dir ./pdfs/ \
--output-dir ./json/ \
--lang ja \
--concurrency 4
Validation
validate checks structural completeness of the extracted JSON and returns warnings without hard-failing — partial results are still usable.
sds-converter validate --input output.json
Examples of what it checks:
- Section 1: no product name (TradeNameJP or TradeNameEN)
- Section 1: SupplierInformation missing
- Section 2: neither Classification nor HazardLabelling extracted
- Section 3: CompositionAndConcentration list is empty
When using the library, convert_to_json returns a (SdsRoot, Vec<String>) tuple — the warnings are surfaced inline.
Output JSON structure
{
"Datasheet": {
"IssueDate": "2024-03-31",
"SDS-SchemaVersionNo": "1.0"
},
"Identification": {
"TradeProductIdentity": {
"TradeNameJP": "Sample Product"
},
"SupplierInformation": {
"CompanyName": "Sample Corp",
"Phone": "03-0000-0000"
}
}
}
The full schema covers all 16 JIS Z 7253 sections with ~200 fields. The official spec and developer manual are on the MHLW website (Japanese).
Using as a library
[dependencies]
sds-converter-core = "0.1"
PDF → JSON
use sds_converter_core::{
converter::{AnthropicBackend, LlmConfig},
convert_to_json, ConvertConfig, Language,
};
#[tokio::main]
async fn main() -> anyhow::Result<()> {
let backend = AnthropicBackend::new(
std::env::var("ANTHROPIC_API_KEY")?,
LlmConfig::default(),
);
let config = ConvertConfig {
source_language: Some(Language::Japanese),
output_language: Language::Japanese,
..Default::default()
};
let (sds, warnings) = convert_to_json(
std::path::Path::new("input.pdf"), &backend, &config
).await?;
for w in &warnings { eprintln!("WARN: {w}"); }
std::fs::write("output.json", serde_json::to_string_pretty(&sds)?)?;
Ok(())
}
JSON → Word document
use sds_converter_core::{convert_from_json, ConvertConfig, Language, SdsRoot};
fn main() -> anyhow::Result<()> {
let sds: SdsRoot = serde_json::from_str(&std::fs::read_to_string("output.json")?)?;
let config = ConvertConfig {
output_language: Language::Japanese,
..Default::default()
};
convert_from_json(&sds, std::path::Path::new("result.docx"), &config)?;
Ok(())
}
Custom LLM backend
use sds_converter_core::{LlmBackend, SdsError};
struct MyBackend;
impl LlmBackend for MyBackend {
async fn complete(&self, system: &str, user: &str) -> Result<String, SdsError> {
// Call your LLM API, return the raw JSON string response
todo!()
}
}
Language support
| Language | --lang |
Source standard | Output DOCX headings |
|---|---|---|---|
| Japanese | ja |
JIS Z 7253 | JIS Z 7253 |
| English | en |
GHS/OSHA HazCom | GHS Rev.10 / ISO 11014 |
| Simplified Chinese | zh-cn |
GB/T 16483-2012 | GB/T 16483-2012 |
| Traditional Chinese | zh-tw |
CNS 15030 | CNS 15030 |
Comparison with alternatives
Open-source
| sds-converter | sds_parser | tungsten | |
|---|---|---|---|
| Language | Rust | Python | Python |
| AI/LLM | Yes (pluggable) | No (regex) | No (rule-based) |
| MHLW JSON | Yes | No | No |
| Bidirectional | Yes (↔ DOCX) | No | No |
| Multilingual | ja / en / zh-CN / zh-TW | Limited | English only |
Commercial (Japan)
| sds-converter | SDS Meister | SmartSDS | Dr.EHS Chemical | |
|---|---|---|---|---|
| AI | Yes (your API key) | No | Yes (translation) | AI-OCR |
| MHLW JSON | Yes | Yes | Yes | Yes |
| PDF → JSON | Yes | No (authoring only) | Partial (JP only) | Yes |
| Open-source | MIT/Apache-2.0 | No | No | No |
sds-converter is the only open-source tool that supports the MHLW schema, runs entirely locally, and handles the full round-trip.
Crate structure
-
sds-converter-core— library. LLM extraction, DOCX generation, MHLW schema types. -
sds-converter— CLI binary.to-json,to-docx,validate,extract-textsubcommands.
Feedback welcome, especially on section 3 component table extraction and non-Japanese document accuracy.
Top comments (0)