Diogo Heleno

Posted on May 19 • Originally published at m21global.com

Building a Terminology Management System for Technical Documentation — A Developer's Guide

#i18n #webdev #productivity #tutorial

Building a Terminology Management System for Technical Documentation — A Developer's Guide

If you've ever worked on internationalization for technical products, you know the pain: "motor" becomes "drive unit" halfway through the docs, safety instructions contradict themselves across languages, and your support team fields tickets about confusing terminology.

A recent deep dive into industrial documentation translation challenges got me thinking about the technical infrastructure needed to prevent these issues. While translation agencies solve this with human processes, developers can build systems that enforce consistency from the ground up.

The Technical Problem with Terminology

Terminology inconsistency isn't just a translation problem — it's a data integrity problem. When your API documentation uses "authentication" in one endpoint and "authorization" in another for the same concept, you're creating the same confusion that industrial translators face with "motor" vs "drive unit."

The core issue: terminology decisions get made in silos without a single source of truth.

Database Schema for Terminology Management

Here's a practical schema for tracking terminology decisions across your technical documentation:

CREATE TABLE terminology (
  id SERIAL PRIMARY KEY,
  source_term VARCHAR(255) NOT NULL,
  target_term VARCHAR(255) NOT NULL,
  source_language CHAR(2) NOT NULL,
  target_language CHAR(2) NOT NULL,
  definition TEXT,
  context VARCHAR(255), -- API, UI, documentation
  status VARCHAR(20) DEFAULT 'approved', -- draft, approved, deprecated
  excluded_terms TEXT[], -- variants to avoid
  source_standard VARCHAR(100), -- ISO standard, style guide, etc
  created_at TIMESTAMP DEFAULT NOW(),
  updated_at TIMESTAMP DEFAULT NOW()
);

CREATE TABLE term_usage (
  id SERIAL PRIMARY KEY,
  terminology_id INTEGER REFERENCES terminology(id),
  document_path VARCHAR(500),
  line_number INTEGER,
  context_snippet TEXT,
  verified_at TIMESTAMP
);

This structure tracks not just what terms to use, but where they're actually used and whether they've been verified recently.

API Integration for Real-Time Validation

Build terminology validation into your content pipeline with a simple API:

from flask import Flask, request, jsonify
import psycopg2
import re

app = Flask(__name__)

@app.route('/validate-terminology', methods=['POST'])
def validate_terminology():
    content = request.json.get('content')
    source_lang = request.json.get('source_language', 'en')
    target_lang = request.json.get('target_language')
    context = request.json.get('context', 'documentation')

    issues = []

    # Check for excluded terms
    excluded_terms = get_excluded_terms(source_lang, target_lang, context)
    for term_data in excluded_terms:
        pattern = r'\b' + re.escape(term_data['excluded_term']) + r'\b'
        matches = re.finditer(pattern, content, re.IGNORECASE)
        for match in matches:
            issues.append({
                'type': 'excluded_term',
                'term': match.group(),
                'position': match.start(),
                'suggested': term_data['approved_term'],
                'reason': term_data['exclusion_reason']
            })

    # Check for missing standardized terms
    standardized_terms = get_standardized_terms(source_lang, context)
    for term_data in standardized_terms:
        # Look for concepts that should use standardized terminology
        if term_data['concept'] in content.lower():
            if term_data['standard_term'] not in content:
                issues.append({
                    'type': 'missing_standard_term',
                    'concept': term_data['concept'],
                    'required_term': term_data['standard_term'],
                    'standard': term_data['source_standard']
                })

    return jsonify({
        'valid': len(issues) == 0,
        'issues': issues
    })

def get_excluded_terms(source_lang, target_lang, context):
    conn = psycopg2.connect(DATABASE_URL)
    cur = conn.cursor()

    query = """
        SELECT target_term as approved_term, 
               unnest(excluded_terms) as excluded_term,
               'Use approved term instead' as exclusion_reason
        FROM terminology 
        WHERE source_language = %s 
        AND target_language = %s 
        AND context = %s 
        AND status = 'approved'
        AND excluded_terms IS NOT NULL
    """

    cur.execute(query, (source_lang, target_lang, context))
    return cur.fetchall()

Git Hooks for Automated Terminology Checks

Integrate terminology validation into your development workflow:

#!/bin/bash
# .git/hooks/pre-commit

# Check documentation files for terminology issues
for file in $(git diff --cached --name-only | grep -E '\.(md|rst|txt)$'); do
    if [ -f "$file" ]; then
        echo "Checking terminology in $file..."

        # Call your terminology API
        result=$(curl -s -X POST \
            -H "Content-Type: application/json" \
            -d "{\"content\": \"$(cat "$file")\", \"context\": \"documentation\"}" \
            http://localhost:5000/validate-terminology)

        valid=$(echo $result | jq -r '.valid')

        if [ "$valid" = "false" ]; then
            echo "Terminology issues found in $file:"
            echo $result | jq -r '.issues[] | "- " + .type + ": " + .term + " (" + .reason + ")"'
            exit 1
        fi
    fi
done

echo "Terminology validation passed"

Automated Terminology Extraction

Extract terms from existing documentation to populate your database:

import spacy
import requests
from collections import Counter

nlp = spacy.load("en_core_web_sm")

def extract_technical_terms(text, domain="general"):
    doc = nlp(text)

    # Extract noun phrases that might be technical terms
    noun_phrases = [chunk.text for chunk in doc.noun_chunks]

    # Filter for technical-sounding terms
    technical_terms = []
    for phrase in noun_phrases:
        # Simple heuristics - you can improve these
        if (len(phrase.split()) <= 3 and 
            any(char.isupper() for char in phrase) and
            not phrase.lower() in ['the', 'a', 'an', 'this', 'that']):
            technical_terms.append(phrase.lower().strip())

    return Counter(technical_terms)

def suggest_terminology_candidates(file_paths):
    all_terms = Counter()

    for file_path in file_paths:
        with open(file_path, 'r') as f:
            content = f.read()
            terms = extract_technical_terms(content)
            all_terms.update(terms)

    # Return terms that appear frequently enough to be significant
    candidates = {term: count for term, count in all_terms.items() 
                 if count >= 3 and len(term.split()) <= 2}

    return candidates

Integration with Documentation Generators

For teams using tools like GitBook, Notion, or custom static site generators, you can inject terminology validation into the build process:

// For GitBook plugin
module.exports = {
    hooks: {
        "page:before": async function(page) {
            const response = await fetch('http://localhost:5000/validate-terminology', {
                method: 'POST',
                headers: { 'Content-Type': 'application/json' },
                body: JSON.stringify({
                    content: page.content,
                    context: 'documentation',
                    source_language: 'en'
                })
            });

            const validation = await response.json();

            if (!validation.valid) {
                console.warn(`Terminology issues in ${page.path}:`);
                validation.issues.forEach(issue => {
                    console.warn(`  - ${issue.type}: ${issue.term}`);
                });
            }

            return page;
        }
    }
};

Monitoring Terminology Drift

Set up monitoring to catch when new terminology appears without approval:

def detect_terminology_drift(new_content, approved_glossary):
    # Extract potential technical terms from new content
    new_terms = extract_technical_terms(new_content)

    # Check against approved glossary
    unknown_terms = []
    for term, frequency in new_terms.items():
        if term not in approved_glossary and frequency >= 2:
            unknown_terms.append({
                'term': term,
                'frequency': frequency,
                'requires_review': True
            })

    return unknown_terms

Building Your Terminology Workflow

Start small with these components:

Database setup: Use the schema above in PostgreSQL or adapt it for your preferred database
API service: Deploy the Flask app for real-time validation
Git integration: Add the pre-commit hook to catch issues early
Documentation build: Integrate validation into your docs pipeline

This infrastructure approach means terminology consistency becomes automatic rather than dependent on human memory and manual processes.

The industrial translation world has learned that terminology management isn't optional for mission-critical documentation. The same principle applies to developer documentation, API specs, and user-facing content.

For more insights on professional terminology management in technical translation, check out this detailed guide on industrial documentation translation.

DEV Community

Building a Terminology Management System for Technical Documentation — A Developer's Guide

Building a Terminology Management System for Technical Documentation — A Developer's Guide

The Technical Problem with Terminology

Database Schema for Terminology Management

API Integration for Real-Time Validation

Git Hooks for Automated Terminology Checks

Automated Terminology Extraction

Integration with Documentation Generators

Monitoring Terminology Drift

Building Your Terminology Workflow

Top comments (0)