How to Convert Journal Articles to JATS XML with Python

#python #automation #tutorial

How to Convert Journal Articles to JATS XML with Python

If you've ever spent hours manually converting Word documents to XML for Open Journal Systems, you know how tedious this process can be. The repetitive tasks of copy-pasting, formatting, and validating XML entries not only eat up time but also open the door to human error. For developers working in academic publishing workflows, automating this task is essential.

The Manual Way (And Why It Breaks)

Most developers working in academic publishing either do it by hand or rely on a combination of manual steps. They open each .docx file, copy content into a template, and painstakingly format it for JATS XML. Some use spreadsheets to track metadata, while others hit API limits trying to upload files one by one. Many run into issues with equation rendering, incorrect citation formats, or missing bibliographies—problems that compound when handling dozens of articles. This method isn’t just slow; it’s unreliable, and it doesn’t scale.

The Python Approach

You could start by writing a script that processes .docx files and extracts content using python-docx. Here's a minimal version that focuses on extracting paragraphs and handling some basic formatting:

import docx
import os
from pathlib import Path

def convert_docx_to_xml(input_path, output_path):
    doc = docx.Document(input_path)
    xml_content = []

    for para in doc.paragraphs:
        style = para.style.name
        if style == 'Heading 1':
            xml_content.append(f"<title>{para.text}</title>")
        elif style == 'Heading 2':
            xml_content.append(f"<subtitle>{para.text}</subtitle>")
        else:
            xml_content.append(f"<p>{para.text}</p>")

    # Write to file
    with open(output_path, 'w') as f:
        f.write('\n'.join(xml_content))

# Process batch
input_folder = "articles/"
output_folder = "converted/"
Path(output_folder).mkdir(exist_ok=True)

for file in os.listdir(input_folder):
    if file.endswith(".docx"):
        input_path = os.path.join(input_folder, file)
        output_path = os.path.join(output_folder, file.replace(".docx", ".xml"))
        convert_docx_to_xml(input_path, output_path)

This script reads paragraphs from a Word document and converts them to basic XML tags. It handles headings and paragraphs but lacks support for math equations, citations, images, or complex metadata. It also doesn’t generate full JATS XML compliant with NLM standards. At scale, such a solution breaks down.

What the Full Tool Handles

Handles complex formatting including tables, figures, and mathematical equations
Preserves bibliographic references and citation styles
Supports batch processing with a clean CLI interface
Produces multiple output formats: PDF, HTML, and JATS XML
Includes error handling and validation to prevent malformed XML
Works with locally stored Word documents without requiring cloud uploads

Running It

To use the Journal Article Converter, you’ll run a simple command in your terminal:

python journal_converter.py --input articles_folder/ --output converted_articles/

This command processes all .docx files in the input directory and outputs PDF, HTML, and JATS XML files in the output folder. The tool supports a range of optional flags for customization but defaults to standard academic publishing formats.

Results

With this tool, you can process dozens of journal articles in minutes, not hours. It generates properly formatted XML that’s ready for import into Open Journal Systems, and produces clean HTML and PDF versions for review. The time saved is significant, especially when dealing with regular submissions.

Get the Script

Skip the build and get a polished, production-ready solution. The Journal Article Converter is a one-time purchase at $29. No subscriptions, no hidden fees.

Download Journal Article Converter →

$29 one-time. No subscription. Works on Windows, Mac, and Linux.

Built by OddShop — Python automation tools for developers and businesses.