DEV Community

Cover image for Building a Multilingual Documentation Pipeline for Product Compliance
Diogo Heleno
Diogo Heleno

Posted on • Originally published at m21global.com

Building a Multilingual Documentation Pipeline for Product Compliance

Building a Multilingual Documentation Pipeline for Product Compliance

If you're working for a company that ships physical products to Europe, you've probably heard about CE marking requirements. What you might not know is how much of a technical challenge the documentation pipeline becomes when you need to maintain synchronized translations across multiple languages and document types.

A recent article on CE marking documentation requirements got me thinking about how developers can build systems to handle this complexity. While compliance teams worry about regulatory requirements, we need to solve the technical problems: version control across languages, automated workflows, and maintaining consistency at scale.

The Technical Challenge

Let's break down what we're actually dealing with:

  • Multiple document types: Instructions for use, safety manuals, declarations of conformity, data sheets
  • Multiple target languages: Every EU country where you sell
  • Version synchronization: When the source document updates, all translations need to update
  • Consistency requirements: Technical terminology must be identical across all documents in each language
  • Audit trails: Regulatory authorities can request documentation history

This isn't a simple "translate and forget" problem. It's a content management system with compliance constraints.

Architecture Overview

Here's a basic pipeline architecture that handles the core requirements:

# docker-compose.yml
services:
  docs-api:
    build: ./api
    environment:
      - DATABASE_URL=postgresql://user:pass@db:5432/compliance_docs
      - TRANSLATION_API_KEY=${TRANSLATION_SERVICE_KEY}

  document-processor:
    build: ./processor
    volumes:
      - ./documents:/app/documents
      - ./output:/app/output

  version-tracker:
    build: ./tracker
    depends_on:
      - db

  db:
    image: postgres:15
    environment:
      - POSTGRES_DB=compliance_docs
Enter fullscreen mode Exit fullscreen mode

Document Version Control

The foundation is tracking document versions and their translation status. Here's a basic schema:

-- Source documents
CREATE TABLE source_documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_type VARCHAR(50) NOT NULL, -- 'manual', 'safety_sheet', etc
    product_id VARCHAR(100) NOT NULL,
    version VARCHAR(20) NOT NULL,
    content_hash VARCHAR(64) NOT NULL,
    file_path TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT NOW(),
    status VARCHAR(20) DEFAULT 'draft'
);

-- Translation jobs
CREATE TABLE translations (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_document_id UUID REFERENCES source_documents(id),
    target_language VARCHAR(5) NOT NULL, -- 'de-DE', 'fr-FR', etc
    translation_status VARCHAR(20) DEFAULT 'pending',
    translator_type VARCHAR(20) NOT NULL, -- 'human', 'machine', 'hybrid'
    completed_at TIMESTAMP,
    file_path TEXT,
    quality_score DECIMAL(3,2)
);

-- Terminology glossary
CREATE TABLE glossary_terms (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_term TEXT NOT NULL,
    target_language VARCHAR(5) NOT NULL,
    target_term TEXT NOT NULL,
    product_category VARCHAR(50),
    approved_by VARCHAR(100),
    created_at TIMESTAMP DEFAULT NOW()
);
Enter fullscreen mode Exit fullscreen mode

Automated Translation Workflow

Here's a Python script that handles the core workflow:

import hashlib
import json
from pathlib import Path
from typing import Dict, List

class DocumentProcessor:
    def __init__(self, db_connection, translation_service):
        self.db = db_connection
        self.translation_service = translation_service

    def process_document_update(self, file_path: str, product_id: str, doc_type: str):
        """Process a source document update and trigger translations"""

        # Calculate content hash
        content_hash = self._calculate_hash(file_path)

        # Check if this version already exists
        existing = self._get_document_by_hash(content_hash, product_id, doc_type)
        if existing:
            print(f"Document {file_path} unchanged, skipping")
            return existing['id']

        # Create new document version
        doc_id = self._create_document_record(file_path, product_id, doc_type, content_hash)

        # Extract translatable content
        content = self._extract_content(file_path, doc_type)

        # Get required languages for this product
        target_languages = self._get_target_languages(product_id)

        # Queue translation jobs
        for lang in target_languages:
            self._queue_translation(doc_id, lang, content)

        return doc_id

    def _extract_content(self, file_path: str, doc_type: str) -> Dict:
        """Extract translatable content based on document type"""

        if doc_type == 'manual':
            # For technical manuals, extract structured content
            return self._extract_manual_content(file_path)
        elif doc_type == 'safety_sheet':
            # Safety sheets have specific regulatory sections
            return self._extract_safety_sheet_content(file_path)
        else:
            # Generic document processing
            return self._extract_generic_content(file_path)

    def _apply_glossary(self, content: str, target_language: str) -> str:
        """Apply approved terminology from glossary"""

        terms = self._get_glossary_terms(target_language)

        for source_term, target_term in terms.items():
            # Use word boundaries to avoid partial matches
            import re
            pattern = rf'\b{re.escape(source_term)}\b'
            content = re.sub(pattern, target_term, content, flags=re.IGNORECASE)

        return content
Enter fullscreen mode Exit fullscreen mode

Integration with Translation Services

For production systems, you'll want to integrate with professional translation APIs:

class TranslationOrchestrator:
    def __init__(self):
        self.machine_translation = MachineTranslationAPI()
        self.human_translation = HumanTranslationAPI()

    async def translate_document(self, doc_id: str, target_lang: str, content: Dict):
        """Route translation based on content type and compliance requirements"""

        doc_type = content['document_type']

        if self._requires_certified_translation(doc_type):
            # Safety-critical content needs human translators
            return await self._human_translation_workflow(doc_id, target_lang, content)
        else:
            # Non-critical content can use machine translation + review
            return await self._hybrid_translation_workflow(doc_id, target_lang, content)

    def _requires_certified_translation(self, doc_type: str) -> bool:
        """Determine if document type requires certified translation"""
        critical_types = ['safety_manual', 'medical_device_ifu', 'declaration_of_conformity']
        return doc_type in critical_types
Enter fullscreen mode Exit fullscreen mode

Monitoring and Compliance Tracking

You need visibility into the translation pipeline status:

# Simple FastAPI endpoint for pipeline status
from fastapi import FastAPI

app = FastAPI()

@app.get("/products/{product_id}/compliance-status")
async def get_compliance_status(product_id: str):
    """Get translation status for all documents for a product"""

    documents = await get_product_documents(product_id)
    target_markets = await get_target_markets(product_id)

    status = {
        'product_id': product_id,
        'target_markets': target_markets,
        'documents': []
    }

    for doc in documents:
        doc_status = {
            'type': doc['document_type'],
            'version': doc['version'],
            'translations': {}
        }

        for market in target_markets:
            lang = market['language']
            translation = await get_translation_status(doc['id'], lang)
            doc_status['translations'][lang] = {
                'status': translation['status'],
                'completed_at': translation['completed_at'],
                'compliance_ready': translation['status'] == 'completed'
            }

        status['documents'].append(doc_status)

    return status
Enter fullscreen mode Exit fullscreen mode

Key Implementation Tips

Start with file watching: Use tools like watchdog in Python to automatically detect when source documents change.

Implement proper queuing: Use Redis or RabbitMQ for translation job queues. Large documents take time to process.

Version everything: Keep old versions of both source and translated documents. Regulatory audits can ask for historical documentation.

Test with real documents: Compliance documentation has specific formatting requirements that break standard document processing tools.

Plan for partial updates: When only one section of a manual changes, you don't want to retranslate the entire document.

Next Steps

This pipeline handles the basics, but production systems need additional features:

  • Integration with CAT (Computer-Assisted Translation) tools
  • Automated quality checks for translated content
  • Workflow management for human translator assignments
  • Integration with product lifecycle management systems
  • Automated compliance status reporting

The regulatory requirements drive the technical architecture more than you might expect. But once you understand the constraints, building a robust multilingual documentation system becomes a well-defined engineering problem.

Building compliance documentation pipelines isn't glamorous work, but it's the kind of system that keeps products shipping and lawyers happy. And in my experience, those tend to be pretty important business requirements.

Top comments (0)