Solved: Migrating Confluence Pages to Markdown for Hugo/Jekyll Blog

#devops #programming #tutorial #cloud

🚀 Executive Summary

TL;DR: This article addresses the challenge of Confluence vendor lock-in by providing a comprehensive Python script to automate the migration of Confluence pages to Markdown. It enables DevOps Engineers and System Administrators to easily publish their documentation to modern static site generators like Hugo or Jekyll, eliminating tedious manual reformatting.

🎯 Key Takeaways

Confluence content can be programmatically accessed using a personal API token, which should be securely managed via environment variables to prevent hardcoding.
The Python script leverages the requests library for Confluence REST API interaction and html2text for converting fetched HTML page bodies into Markdown format suitable for static site generators.
The migration process includes handling API pagination, slugifying page titles for clean URLs, and generating YAML front matter with essential metadata like title, dates, and categories for Hugo/Jekyll.
Common pitfalls include Confluence API rate limiting, imperfect conversion of complex HTML (e.g., embedded macros), and the script’s focus on text content without automatically downloading or re-linking embedded images and attachments.

Migrating Confluence Pages to Markdown for Hugo/Jekyll Blog

As DevOps Engineers and System Administrators, we often find ourselves wrestling with documentation challenges. Confluence is a powerful collaboration tool, widely used for team knowledge bases, project documentation, and technical articles. However, its proprietary format can quickly become a vendor lock-in dilemma. What if you want to publish your carefully crafted Confluence pages to a modern static site generator like Hugo or Jekyll, perhaps for a public-facing blog or a more lightweight internal knowledge base?

Manually copying and pasting content, then reformatting it into Markdown, is not only tedious but also prone to errors, especially for large volumes of pages. This article provides a comprehensive, step-by-step technical tutorial on how to automate the migration of your Confluence pages to Markdown, making them ready for publication on your Hugo or Jekyll blog.

Unlock your content, break free from proprietary formats, and embrace the versatility of Markdown and static site generators. Let’s get started.

Prerequisites

Before we dive into the migration process, ensure you have the following in place:

Confluence Cloud Instance: Access to a Confluence Cloud instance with sufficient permissions to view the pages you intend to migrate. You will need to create an API token.
Confluence API Token: A personal API token for authentication with the Confluence REST API. We will walk through how to generate this.
Python 3.x: Installed on your local machine.
pip: Python’s package installer, usually bundled with Python 3.x.
Basic Understanding: Familiarity with Python scripting, command-line interfaces, and the concept of static site generators (Hugo/Jekyll) will be beneficial.

Step-by-Step Guide

Step 1: Generate a Confluence API Token

To programmatically access your Confluence content, you need an API token. This acts as a secure password for API requests, tied to your Atlassian account.

Log in to your Atlassian account (id.atlassian.com).
Navigate to Security > Create and manage API tokens.
Click Create API token.
Give your token a descriptive Label (e.g., “Confluence Migrator”).
Copy the generated token immediately. It will not be shown again.

Security Note: Treat your API token like a password. Do not hardcode it directly into scripts for production use; instead, use environment variables or a secure configuration management system. For this tutorial, we’ll use environment variables for demonstration.

Step 2: Set Up Your Python Environment

It’s good practice to work within a Python virtual environment to manage dependencies.

First, create a project directory and a virtual environment:

mkdir confluence-migrator
cd confluence-migrator
python3 -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate

Next, install the necessary Python libraries. We’ll use requests for making HTTP calls to the Confluence API and html2text for converting the fetched HTML content into Markdown.

pip install requests html2text

Step 3: Write the Python Migration Script

Now, let’s craft the Python script that will fetch your Confluence pages, convert them, and save them as Markdown files. Create a file named migrate.py in your project directory.

3.1. Configure Authentication and API Endpoints

We’ll store sensitive information in environment variables. Set these in your shell before running the script (or add them to a .env file and use python-dotenv).

export CONFLUENCE_URL="https://your-domain.atlassian.net/wiki"
export CONFLUENCE_EMAIL="your-atlassian-email@example.com"
export CONFLUENCE_API_TOKEN="YOUR_API_TOKEN_HERE"
export CONFLUENCE_SPACE_KEYS="SPACEKEY1,SPACEKEY2" # Comma-separated list of space keys

Your migrate.py script will read these variables.</p<

import os
import requests
import html2text
import re
from datetime import datetime

# --- Configuration ---
CONFLUENCE_URL = os.getenv("CONFLUENCE_URL")
CONFLUENCE_EMAIL = os.getenv("CONFLUENCE_EMAIL")
CONFLUENCE_API_TOKEN = os.getenv("CONFLUENCE_API_TOKEN")
CONFLUENCE_SPACE_KEYS = os.getenv("CONFLUENCE_SPACE_KEYS", "").split(',')

if not all([CONFLUENCE_URL, CONFLUENCE_EMAIL, CONFLUENCE_API_TOKEN]):
    print("Error: CONFLUENCE_URL, CONFLUENCE_EMAIL, or CONFLUENCE_API_TOKEN not set.")
    exit(1)

HEADERS = {
    "Accept": "application/json"
}
AUTH = (CONFLUENCE_EMAIL, CONFLUENCE_API_TOKEN)
OUTPUT_DIR = "markdown_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)

# --- Helper Functions ---
def slugify(text):
    text = re.sub(r'[^a-z0-9\s-]', '', text.lower())
    text = re.sub(r'[\s-]+', '-', text).strip('-')
    return text

def get_confluence_pages(space_key):
    print(f"Fetching pages for space: {space_key}")
    pages = []
    start = 0
    limit = 25  # Max 25 for v1 API
    while True:
        url = f"{CONFLUENCE_URL}/rest/api/content?spaceKey={space_key}&expand=body.view,version&start={start}&limit={limit}"
        response = requests.get(url, headers=HEADERS, auth=AUTH)
        response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
        data = response.json()
        pages.extend(data['results'])

        if 'next' not in data['_links']:
            break
        start += limit
        print(f"  Fetched {len(pages)} pages. Continuing for more...")
    return pages

def get_page_content(page_id):
    url = f"{CONFLUENCE_URL}/rest/api/content/{page_id}?expand=body.view,version"
    response = requests.get(url, headers=HEADERS, auth=AUTH)
    response.raise_for_status()
    return response.json()

def convert_html_to_markdown(html_content):
    h = html2text.HTML2Text()
    h.ignore_images = False # Set to True if you don't want image links
    h.images_as_html = True # Keep images as HTML img tags (useful for Hugo/Jekyll shortcodes)
    h.body_width = 0        # Don't wrap lines
    markdown_content = h.handle(html_content)
    return markdown_content

def generate_front_matter(title, creation_date, update_date, slug, space_key):
    # Adjust for Hugo or Jekyll requirements
    # For Hugo:
    # ---
    # title: "My Confluence Page Title"
    # date: 2023-10-27T10:00:00Z
    # lastmod: 2023-10-27T14:30:00Z
    # draft: false
    # tags: ["confluence", "migration", "devops"]
    # categories: ["documentation", "tech"]
    # ---
    #
    # For Jekyll:
    # ---
    # layout: post
    # title: "My Confluence Page Title"
    # date: 2023-10-27 10:00:00 +0000
    # categories: [documentation, tech]
    # tags: [confluence, migration, devops]
    # ---

    # Dates often come as "2023-10-27T14:30:00.000Z" from Confluence
    created = datetime.strptime(creation_date.split('.')[0], "%Y-%m-%dT%H:%M:%S").isoformat() + "Z"
    updated = datetime.strptime(update_date.split('.')[0], "%Y-%m-%dT%H:%M:%S").isoformat() + "Z"

    front_matter = f"""---
title: "{title.replace('"', '\\"')}"
date: {created}
lastmod: {updated}
draft: false
categories: ["{space_key.lower()}"]
tags: ["confluence", "migration"]
---

"""
    return front_matter

# --- Main Logic ---
def main():
    for space_key in CONFLUENCE_SPACE_KEYS:
        if not space_key:
            continue
        print(f"Processing space: {space_key}")
        pages_in_space = get_confluence_pages(space_key)

        for page_summary in pages_in_space:
            page_id = page_summary['id']
            page_title = page_summary['title']
            page_type = page_summary['type'] # Usually 'page' or 'blogpost'

            if page_type != 'page': # We might want to filter out blog posts or other content types
                print(f"  Skipping {page_type}: {page_title}")
                continue

            print(f"  Processing page: {page_title} (ID: {page_id})")

            try:
                full_page_data = get_page_content(page_id)
                html_content = full_page_data['body']['view']['value']

                # Get creation and update dates
                creation_date = full_page_data['version']['when'] # This is the last updated date
                # Confluence v1 API does not directly expose creationDate in the default expand.
                # For more accurate creationDate, you'd need to fetch page history or use v2 API (more complex for HTML body).
                # For simplicity, we'll use version['when'] as both creation and update, or just update.
                # Let's use version['when'] for both 'date' and 'lastmod' to be safe.
                # For actual creationDate, you might need to query /rest/api/content/{id}/history for the first version.

                # Assuming version['when'] is adequate for 'date' and 'lastmod' for this migration scope
                current_time_iso = datetime.now().isoformat(timespec='seconds') + "Z"

                markdown_content = convert_html_to_markdown(html_content)

                slug = slugify(page_title)
                filename = os.path.join(OUTPUT_DIR, f"{slug}.md")

                # Re-evaluate dates based on typical Confluence data:
                # 'version' contains 'when' which is the last modified date
                # 'version' contains 'created' but it's not the page creation date, but version creation.
                # To get the original page creation date, a separate API call to history or specific fields is needed.
                # For this tutorial, we'll use the version's 'when' for both date and lastmod.
                # In a real-world scenario, you might want to fetch history to get the first version's 'when' for original creation date.

                # Using current_time_iso for 'date' and 'lastmod' to make it consistent for new blog posts.
                # Or, using the Confluence 'when' attribute if it's reliable for last modified.
                # Let's use page_summary['version']['when'] for last modified.
                # For initial 'date', let's use the first known creation date or just a default.

                # For simplicity, let's just use the current date for the 'date' field in front matter,
                # and the Confluence 'when' for 'lastmod'.
                front_matter = generate_front_matter(
                    page_title,
                    current_time_iso, # Or use page_summary['version']['when'] for original page creation if available easily
                    page_summary['version']['when'],
                    slug,
                    space_key
                )

                with open(filename, "w", encoding="utf-8") as f:
                    f.write(front_matter)
                    f.write(markdown_content)
                print(f"  Saved '{page_title}' to {filename}")

            except requests.exceptions.RequestException as e:
                print(f"  Error fetching page {page_id} ({page_title}): {e}")
            except Exception as e:
                print(f"  An unexpected error occurred for page {page_id} ({page_title}): {e}")

if __name__ == "__main__":
    main()

Logic Explanation:

The script starts by loading your Confluence URL, email, API token, and desired space keys from environment variables for security.
get_confluence_pages paginates through all pages within a specified Confluence space using the v1 REST API endpoint /rest/api/content, expanding body.view to get the rendered HTML content and version for modification dates.
get_page_content fetches the full content of a specific page ID.
convert_html_to_markdown utilizes the html2text library to transform the fetched HTML into Markdown. We configure it to retain images as HTML img tags, which often integrate better with static site generators.
generate_front_matter creates the YAML front matter expected by Hugo or Jekyll, including title, publication date (date), last modification date (lastmod), and categories/tags derived from the Confluence space.
The main function iterates through each specified Confluence space, fetches its pages, converts them, and saves them to individual .md files in the markdown_output directory.
The slugify function ensures filenames are clean and URL-friendly.

3.2. Run the Script

With your environment variables set and the script ready, execute it from your terminal:

python3 migrate.py

You should see output indicating pages being processed, and a new markdown_output directory will be populated with your Confluence content in Markdown format.

Step 4: Integrate with Hugo/Jekyll

The final step is to incorporate the generated Markdown files into your static site generator project.

Hugo: Copy the .md files from markdown_output into your Hugo project’s content/posts directory (or any other content section you prefer).
Jekyll: Place the .md files into your Jekyll project’s _posts directory. Remember Jekyll often expects filenames in the format YYYY-MM-DD-title.md. You might need to adjust the slugify and filename generation in the script to prepend the date.

After placing the files, run your static site generator’s local server (e.g., hugo server or bundle exec jekyll serve) to preview the migrated content and make any necessary style or formatting adjustments.

Common Pitfalls

API Rate Limiting: Confluence Cloud APIs have rate limits. If you’re migrating a very large number of pages, you might hit these limits, resulting in 429 Too Many Requests errors. Implement a retry mechanism with exponential backoff if this becomes an issue.
Complex HTML Conversion:</ 関係ない/strong> While html2text is good, Confluence’s rich editor can generate highly complex HTML, including embedded macros, custom CSS, or specific table structures that might not convert perfectly to Markdown. Manual review and post-conversion cleanup of the Markdown files may be necessary, especially for heavily formatted pages.
Missing Attachments/Images: This script focuses on text content. Embedded images are converted to their <img> tags, but the images themselves are not downloaded. A more advanced script would need to identify image URLs, download them, and update Markdown references to point to local assets.
Authentication Errors: Double-check your CONFLUENCE_URL, CONFLUENCE_EMAIL, and CONFLUENCE_API_TOKEN values. A common mistake is using your regular password instead of an API token, or having typos in the URL.

Conclusion

You’ve successfully automated the often daunting task of migrating Confluence pages to Markdown. This process empowers you to take control of your content, making it portable and future-proof. By leveraging static site generators, you gain benefits like improved performance, enhanced security, simplified hosting, and seamless integration with modern DevOps workflows.

Consider this script a solid foundation. Next steps could involve:

Automating image and attachment migration.
Implementing more sophisticated error handling and logging.
Adding support for Confluence blog posts or other content types.
Creating a continuous integration pipeline to periodically sync changes from Confluence to your static site.

Embrace the power of automation and keep your documentation agile and accessible!