Diogo Heleno

Posted on Apr 21 • Originally published at m21global.com

Building a Document Management Pipeline for Multilingual Safety Data Sheets

#i18n #webdev #productivity #tutorial

Building a Document Management Pipeline for Multilingual Safety Data Sheets

If you're developing systems for chemical manufacturers or exporters, you've probably encountered Safety Data Sheets (SDS). These aren't just documents—they're regulatory compliance requirements that can make or break international trade deals. A recent article on SDS translation requirements highlights the complexity of managing these across multiple markets, but what does this mean from a technical implementation perspective?

Let's build a document management pipeline that handles the unique challenges of multilingual SDS workflows.

The Technical Challenge

SDS documents follow a rigid 16-section structure mandated by regulations like REACH (EU), OSHA HazCom (US), and GB CLP (UK). Unlike marketing content, these documents contain:

Standardized hazard codes (H-statements) and precautionary statements (P-statements)
Chemical identifiers (CAS numbers, EINECS numbers)
Regulatory data that varies by jurisdiction
Structured data that feeds into logistics systems

The pipeline needs to handle versioning, ensure data consistency across languages, and maintain an audit trail for regulatory compliance.

Architecture Overview

Here's a system design that addresses the core requirements:

# Core data models
from dataclasses import dataclass
from enum import Enum
from typing import Dict, List, Optional
import uuid

class SDSSection(Enum):
    PRODUCT_ID = 1
    HAZARD_ID = 2
    COMPOSITION = 3
    FIRST_AID = 4
    FIREFIGHTING = 5
    ACCIDENTAL_RELEASE = 6
    HANDLING_STORAGE = 7
    EXPOSURE_CONTROLS = 8
    PHYSICAL_CHEMICAL = 9
    STABILITY_REACTIVITY = 10
    TOXICOLOGICAL = 11
    ECOLOGICAL = 12
    DISPOSAL = 13
    TRANSPORT = 14
    REGULATORY = 15
    OTHER = 16

@dataclass
class SDSDocument:
    id: str
    product_id: str
    language: str
    version: str
    sections: Dict[SDSSection, str]
    regulatory_codes: List[str]
    last_updated: str
    compliance_status: str

Handling Regulatory Data

The trickiest part is managing regulatory codes that change meaning across jurisdictions. Here's how to structure this:

class RegulatoryCodeManager:
    def __init__(self):
        self.code_mappings = {
            'EU': {
                'H315': 'Causes skin irritation',
                'P302+P352': 'IF ON SKIN: Wash with plenty of water'
            },
            'US': {
                'H315': 'Causes skin irritation',
                'P302+P352': 'IF ON SKIN: Wash with plenty of soap and water'
            }
        }

    def get_localized_code(self, code: str, jurisdiction: str, language: str) -> str:
        # This would typically hit a database or API
        base_text = self.code_mappings.get(jurisdiction, {}).get(code)
        if base_text:
            return self.translate_regulatory_text(base_text, language)
        return code

    def translate_regulatory_text(self, text: str, target_lang: str) -> str:
        # Integration point with translation memory systems
        # Regulatory codes have standardized translations
        pass

Version Control and Change Tracking

SDS documents need rigorous version control. Regulators may ask to see the exact version that was current on a specific date:

import json
from datetime import datetime

class SDSVersionManager:
    def __init__(self, storage_backend):
        self.storage = storage_backend

    def create_version(self, document: SDSDocument, changes: Dict) -> str:
        version_id = f"v{datetime.now().strftime('%Y%m%d_%H%M%S')}"

        # Store the complete document state
        version_data = {
            'document_id': document.id,
            'version_id': version_id,
            'timestamp': datetime.now().isoformat(),
            'changes': changes,
            'complete_document': document.__dict__
        }

        self.storage.save(f"sds_versions/{document.id}/{version_id}.json", 
                         json.dumps(version_data))

        return version_id

    def get_version_at_date(self, document_id: str, target_date: datetime) -> Optional[SDSDocument]:
        # Retrieve the version that was current at a specific date
        versions = self.storage.list(f"sds_versions/{document_id}/")

        valid_versions = [
            v for v in versions 
            if datetime.fromisoformat(v['timestamp']) <= target_date
        ]

        if valid_versions:
            latest = max(valid_versions, key=lambda x: x['timestamp'])
            return SDSDocument(**latest['complete_document'])

        return None

Translation Workflow Integration

The system needs to track translation status and handle the multi-stage review process:

class TranslationWorkflow:
    def __init__(self, translation_api):
        self.api = translation_api
        self.workflow_stages = ['translation', 'review', 'qa_review']

    def submit_for_translation(self, source_doc: SDSDocument, target_languages: List[str]) -> str:
        job_id = str(uuid.uuid4())

        # Extract translatable content while preserving structure
        translatable_segments = self.extract_segments(source_doc)

        # Submit to translation management system
        self.api.create_job(
            job_id=job_id,
            source_language=source_doc.language,
            target_languages=target_languages,
            segments=translatable_segments,
            domain='chemical_safety',
            workflow_stages=self.workflow_stages
        )

        return job_id

    def extract_segments(self, doc: SDSDocument) -> List[Dict]:
        segments = []

        for section, content in doc.sections.items():
            # Split content while preserving regulatory codes
            segments.append({
                'id': f"{doc.id}_{section.value}",
                'source_text': content,
                'context': f"SDS Section {section.value}",
                'locked_terms': self.extract_regulatory_codes(content)
            })

        return segments

    def extract_regulatory_codes(self, content: str) -> List[str]:
        # Identify H-codes, P-codes, CAS numbers, etc. that shouldn't be translated
        import re
        codes = []

        # H-statements (e.g., H315, H302+H332)
        h_codes = re.findall(r'H\d{3}(?:\+H\d{3})*', content)
        codes.extend(h_codes)

        # P-statements
        p_codes = re.findall(r'P\d{3}(?:\+P\d{3})*', content)
        codes.extend(p_codes)

        # CAS numbers
        cas_numbers = re.findall(r'\d{2,7}-\d{2}-\d', content)
        codes.extend(cas_numbers)

        return codes

Quality Assurance Automation

Automated checks can catch common errors before human review:

class SDSQualityChecker:
    def __init__(self):
        self.required_sections = list(SDSSection)
        self.regulatory_patterns = {
            'h_codes': r'H\d{3}',
            'p_codes': r'P\d{3}',
            'cas_numbers': r'\d{2,7}-\d{2}-\d'
        }

    def validate_document(self, doc: SDSDocument) -> List[str]:
        errors = []

        # Check all 16 sections are present
        missing_sections = [
            section for section in self.required_sections 
            if section not in doc.sections or not doc.sections[section].strip()
        ]

        if missing_sections:
            errors.append(f"Missing sections: {missing_sections}")

        # Validate regulatory codes are properly formatted
        for section, content in doc.sections.items():
            errors.extend(self.check_regulatory_codes(content, section))

        # Cross-reference with source document if available
        errors.extend(self.check_consistency(doc))

        return errors

    def check_regulatory_codes(self, content: str, section: SDSSection) -> List[str]:
        errors = []

        # Ensure H-codes follow proper format
        import re
        h_codes = re.findall(r'H\d+', content)
        for code in h_codes:
            if not re.match(r'H\d{3}$', code):
                errors.append(f"Invalid H-code format in section {section.value}: {code}")

        return errors

    def check_consistency(self, doc: SDSDocument) -> List[str]:
        # Check that hazard codes in section 2 match those referenced in other sections
        errors = []

        section_2_codes = set(re.findall(r'H\d{3}', doc.sections.get(SDSSection.HAZARD_ID, '')))

        for section, content in doc.sections.items():
            if section != SDSSection.HAZARD_ID:
                referenced_codes = set(re.findall(r'H\d{3}', content))
                invalid_refs = referenced_codes - section_2_codes

                if invalid_refs:
                    errors.append(f"Section {section.value} references codes not in hazard identification: {invalid_refs}")

        return errors

Integration Points

This system needs to integrate with:

PLM/ERP systems for product data synchronization
Translation management platforms (Phrase, Lokalise, etc.)
Regulatory databases for up-to-date code definitions
Document generation tools for final PDF output
Audit logging systems for compliance tracking

Deployment Considerations

For production deployment:

Use event-driven architecture to handle document updates across languages
Implement proper backup and disaster recovery for regulatory compliance
Set up monitoring for translation workflow bottlenecks
Consider geographic distribution based on your target markets

Managing multilingual SDS documents is complex, but with the right technical foundation, you can build a system that ensures compliance while streamlining operations for your chemical industry clients.

The key is treating these documents as structured data with regulatory constraints, not just text to be translated.

DEV Community

Building a Document Management Pipeline for Multilingual Safety Data Sheets

Building a Document Management Pipeline for Multilingual Safety Data Sheets

The Technical Challenge

Architecture Overview

Handling Regulatory Data

Version Control and Change Tracking

Translation Workflow Integration

Quality Assurance Automation

Integration Points

Deployment Considerations

Top comments (0)