DEV Community

Oddshop
Oddshop

Posted on • Originally published at oddshop.work

How to Convert Journal Articles from Word to JATS XML with Python

How to Convert Journal Articles from Word to JATS XML with Python

If you've ever spent hours manually converting Word documents to XML for scholarly journals, you know the frustration. The process is tedious, error-prone, and eats up valuable development time. As academic publishers grow, the need to automate this workflow becomes more urgent — especially when handling dozens or hundreds of articles at once.

The Manual Way (And Why It Breaks)

Most developers who work with academic publishing tools end up doing this by hand. They open each .docx file, copy-paste content into a templating system, and painstakingly format equations, citations, and figures to match journal requirements. Some even build spreadsheets to track metadata or use third-party tools that only support one output format at a time. The result is a slow, inconsistent pipeline — and constant risk of human error.

Worse still, when working with Open Journal Systems (OJS), you're often required to export to JATS XML, a structured format that must follow specific NLM standards. Each article has to be processed individually, and any formatting issues will break the import process. The manual method isn’t just slow — it’s unsustainable.

The Python Approach

Here’s a basic Python script that demonstrates how to extract content from a Word document and begin converting it to XML. This is not production-ready but shows how the core logic works:

import docx
from lxml import etree

def convert_docx_to_xml(docx_path):
    doc = docx.Document(docx_path)
    root = etree.Element("article")
    body = etree.SubElement(root, "body")

    for para in doc.paragraphs:
        p = etree.SubElement(body, "p")
        p.text = para.text

    # Save as XML
    tree = etree.ElementTree(root)
    tree.write("output.xml", pretty_print=True, encoding="utf-8")

convert_docx_to_xml("article.docx")
Enter fullscreen mode Exit fullscreen mode

This code reads a Word file and extracts paragraph text, wrapping it in basic XML tags. While it gives you a starting point, it doesn’t handle tables, figures, equations, or structured metadata — all of which are necessary for compliant JATS XML. It also doesn’t support batch processing or multiple formats like PDF or HTML, making it impractical for real-world use.

What the Full Tool Handles

The Journal Article Converter goes well beyond what simple code snippets can do. It:

  • Handles complex Word document structures including nested tables, footnotes, and embedded images.
  • Automatically preserves mathematical equations and renders them as MathML in XML.
  • Supports batch processing of multiple articles in a single command.
  • Generates output in multiple formats: PDF, HTML, and JATS XML.
  • Includes proper error handling and logging for failed conversions.
  • Works with both .docx and .doc files.
  • Provides a clean CLI interface for integration into larger systems.

Running It

You can run the converter from the command line like this:

python journal_converter.py --input articles_folder/ --output converted_articles/
# Converts all Word docs in input folder to PDF, HTML, and JATS XML
Enter fullscreen mode Exit fullscreen mode

It takes an input directory with .docx files and generates a corresponding set of outputs — one folder per article, with all formats generated in a single step. The tool logs which files succeeded or failed, so you can track progress easily.

Results

With this tool, you can convert dozens of documents in minutes instead of hours. You get structured, compliant JATS XML files ready for import into OJS, along with PDFs and HTML versions for review. It solves the biggest pain point in many academic workflows — the manual labor and time spent on formatting.

Get the Script

If you don't want to build this yourself, the Journal Article Converter saves you the time and effort. It’s a polished, tested solution that handles all the edge cases you’ll encounter in real-world publishing workflows.

Download Journal Article Converter →

$29 one-time. No subscription. Works on Windows, Mac, and Linux.

Built by OddShop — Python automation tools for developers and businesses.

Top comments (0)