đ Executive Summary
TL;DR: This article addresses the challenge of Confluence vendor lock-in by providing a comprehensive Python script to automate the migration of Confluence pages to Markdown. It enables DevOps Engineers and System Administrators to easily publish their documentation to modern static site generators like Hugo or Jekyll, eliminating tedious manual reformatting.
đŻ Key Takeaways
- Confluence content can be programmatically accessed using a personal API token, which should be securely managed via environment variables to prevent hardcoding.
- The Python script leverages the
requestslibrary for Confluence REST API interaction andhtml2textfor converting fetched HTML page bodies into Markdown format suitable for static site generators. - The migration process includes handling API pagination, slugifying page titles for clean URLs, and generating YAML front matter with essential metadata like title, dates, and categories for Hugo/Jekyll.
- Common pitfalls include Confluence API rate limiting, imperfect conversion of complex HTML (e.g., embedded macros), and the scriptâs focus on text content without automatically downloading or re-linking embedded images and attachments.
Migrating Confluence Pages to Markdown for Hugo/Jekyll Blog
As DevOps Engineers and System Administrators, we often find ourselves wrestling with documentation challenges. Confluence is a powerful collaboration tool, widely used for team knowledge bases, project documentation, and technical articles. However, its proprietary format can quickly become a vendor lock-in dilemma. What if you want to publish your carefully crafted Confluence pages to a modern static site generator like Hugo or Jekyll, perhaps for a public-facing blog or a more lightweight internal knowledge base?
Manually copying and pasting content, then reformatting it into Markdown, is not only tedious but also prone to errors, especially for large volumes of pages. This article provides a comprehensive, step-by-step technical tutorial on how to automate the migration of your Confluence pages to Markdown, making them ready for publication on your Hugo or Jekyll blog.
Unlock your content, break free from proprietary formats, and embrace the versatility of Markdown and static site generators. Letâs get started.
Prerequisites
Before we dive into the migration process, ensure you have the following in place:
- Confluence Cloud Instance: Access to a Confluence Cloud instance with sufficient permissions to view the pages you intend to migrate. You will need to create an API token.
- Confluence API Token: A personal API token for authentication with the Confluence REST API. We will walk through how to generate this.
- Python 3.x: Installed on your local machine.
- pip: Pythonâs package installer, usually bundled with Python 3.x.
- Basic Understanding: Familiarity with Python scripting, command-line interfaces, and the concept of static site generators (Hugo/Jekyll) will be beneficial.
Step-by-Step Guide
Step 1: Generate a Confluence API Token
To programmatically access your Confluence content, you need an API token. This acts as a secure password for API requests, tied to your Atlassian account.
- Log in to your Atlassian account (id.atlassian.com).
- Navigate to Security > Create and manage API tokens.
- Click Create API token.
- Give your token a descriptive Label (e.g., âConfluence Migratorâ).
- Copy the generated token immediately. It will not be shown again.
Security Note: Treat your API token like a password. Do not hardcode it directly into scripts for production use; instead, use environment variables or a secure configuration management system. For this tutorial, weâll use environment variables for demonstration.
Step 2: Set Up Your Python Environment
Itâs good practice to work within a Python virtual environment to manage dependencies.
First, create a project directory and a virtual environment:
mkdir confluence-migrator
cd confluence-migrator
python3 -m venv venv
source venv/bin/activate # On Windows: .\venv\Scripts\activate
Next, install the necessary Python libraries. Weâll use requests for making HTTP calls to the Confluence API and html2text for converting the fetched HTML content into Markdown.
pip install requests html2text
Step 3: Write the Python Migration Script
Now, letâs craft the Python script that will fetch your Confluence pages, convert them, and save them as Markdown files. Create a file named migrate.py in your project directory.
3.1. Configure Authentication and API Endpoints
Weâll store sensitive information in environment variables. Set these in your shell before running the script (or add them to a .env file and use python-dotenv).
export CONFLUENCE_URL="https://your-domain.atlassian.net/wiki"
export CONFLUENCE_EMAIL="your-atlassian-email@example.com"
export CONFLUENCE_API_TOKEN="YOUR_API_TOKEN_HERE"
export CONFLUENCE_SPACE_KEYS="SPACEKEY1,SPACEKEY2" # Comma-separated list of space keys
Your migrate.py script will read these variables.</p<
import os
import requests
import html2text
import re
from datetime import datetime
# --- Configuration ---
CONFLUENCE_URL = os.getenv("CONFLUENCE_URL")
CONFLUENCE_EMAIL = os.getenv("CONFLUENCE_EMAIL")
CONFLUENCE_API_TOKEN = os.getenv("CONFLUENCE_API_TOKEN")
CONFLUENCE_SPACE_KEYS = os.getenv("CONFLUENCE_SPACE_KEYS", "").split(',')
if not all([CONFLUENCE_URL, CONFLUENCE_EMAIL, CONFLUENCE_API_TOKEN]):
print("Error: CONFLUENCE_URL, CONFLUENCE_EMAIL, or CONFLUENCE_API_TOKEN not set.")
exit(1)
HEADERS = {
"Accept": "application/json"
}
AUTH = (CONFLUENCE_EMAIL, CONFLUENCE_API_TOKEN)
OUTPUT_DIR = "markdown_output"
os.makedirs(OUTPUT_DIR, exist_ok=True)
# --- Helper Functions ---
def slugify(text):
text = re.sub(r'[^a-z0-9\s-]', '', text.lower())
text = re.sub(r'[\s-]+', '-', text).strip('-')
return text
def get_confluence_pages(space_key):
print(f"Fetching pages for space: {space_key}")
pages = []
start = 0
limit = 25 # Max 25 for v1 API
while True:
url = f"{CONFLUENCE_URL}/rest/api/content?spaceKey={space_key}&expand=body.view,version&start={start}&limit={limit}"
response = requests.get(url, headers=HEADERS, auth=AUTH)
response.raise_for_status() # Raises HTTPError for bad responses (4xx or 5xx)
data = response.json()
pages.extend(data['results'])
if 'next' not in data['_links']:
break
start += limit
print(f" Fetched {len(pages)} pages. Continuing for more...")
return pages
def get_page_content(page_id):
url = f"{CONFLUENCE_URL}/rest/api/content/{page_id}?expand=body.view,version"
response = requests.get(url, headers=HEADERS, auth=AUTH)
response.raise_for_status()
return response.json()
def convert_html_to_markdown(html_content):
h = html2text.HTML2Text()
h.ignore_images = False # Set to True if you don't want image links
h.images_as_html = True # Keep images as HTML img tags (useful for Hugo/Jekyll shortcodes)
h.body_width = 0 # Don't wrap lines
markdown_content = h.handle(html_content)
return markdown_content
def generate_front_matter(title, creation_date, update_date, slug, space_key):
# Adjust for Hugo or Jekyll requirements
# For Hugo:
# ---
# title: "My Confluence Page Title"
# date: 2023-10-27T10:00:00Z
# lastmod: 2023-10-27T14:30:00Z
# draft: false
# tags: ["confluence", "migration", "devops"]
# categories: ["documentation", "tech"]
# ---
#
# For Jekyll:
# ---
# layout: post
# title: "My Confluence Page Title"
# date: 2023-10-27 10:00:00 +0000
# categories: [documentation, tech]
# tags: [confluence, migration, devops]
# ---
# Dates often come as "2023-10-27T14:30:00.000Z" from Confluence
created = datetime.strptime(creation_date.split('.')[0], "%Y-%m-%dT%H:%M:%S").isoformat() + "Z"
updated = datetime.strptime(update_date.split('.')[0], "%Y-%m-%dT%H:%M:%S").isoformat() + "Z"
front_matter = f"""---
title: "{title.replace('"', '\\"')}"
date: {created}
lastmod: {updated}
draft: false
categories: ["{space_key.lower()}"]
tags: ["confluence", "migration"]
---
"""
return front_matter
# --- Main Logic ---
def main():
for space_key in CONFLUENCE_SPACE_KEYS:
if not space_key:
continue
print(f"Processing space: {space_key}")
pages_in_space = get_confluence_pages(space_key)
for page_summary in pages_in_space:
page_id = page_summary['id']
page_title = page_summary['title']
page_type = page_summary['type'] # Usually 'page' or 'blogpost'
if page_type != 'page': # We might want to filter out blog posts or other content types
print(f" Skipping {page_type}: {page_title}")
continue
print(f" Processing page: {page_title} (ID: {page_id})")
try:
full_page_data = get_page_content(page_id)
html_content = full_page_data['body']['view']['value']
# Get creation and update dates
creation_date = full_page_data['version']['when'] # This is the last updated date
# Confluence v1 API does not directly expose creationDate in the default expand.
# For more accurate creationDate, you'd need to fetch page history or use v2 API (more complex for HTML body).
# For simplicity, we'll use version['when'] as both creation and update, or just update.
# Let's use version['when'] for both 'date' and 'lastmod' to be safe.
# For actual creationDate, you might need to query /rest/api/content/{id}/history for the first version.
# Assuming version['when'] is adequate for 'date' and 'lastmod' for this migration scope
current_time_iso = datetime.now().isoformat(timespec='seconds') + "Z"
markdown_content = convert_html_to_markdown(html_content)
slug = slugify(page_title)
filename = os.path.join(OUTPUT_DIR, f"{slug}.md")
# Re-evaluate dates based on typical Confluence data:
# 'version' contains 'when' which is the last modified date
# 'version' contains 'created' but it's not the page creation date, but version creation.
# To get the original page creation date, a separate API call to history or specific fields is needed.
# For this tutorial, we'll use the version's 'when' for both date and lastmod.
# In a real-world scenario, you might want to fetch history to get the first version's 'when' for original creation date.
# Using current_time_iso for 'date' and 'lastmod' to make it consistent for new blog posts.
# Or, using the Confluence 'when' attribute if it's reliable for last modified.
# Let's use page_summary['version']['when'] for last modified.
# For initial 'date', let's use the first known creation date or just a default.
# For simplicity, let's just use the current date for the 'date' field in front matter,
# and the Confluence 'when' for 'lastmod'.
front_matter = generate_front_matter(
page_title,
current_time_iso, # Or use page_summary['version']['when'] for original page creation if available easily
page_summary['version']['when'],
slug,
space_key
)
with open(filename, "w", encoding="utf-8") as f:
f.write(front_matter)
f.write(markdown_content)
print(f" Saved '{page_title}' to {filename}")
except requests.exceptions.RequestException as e:
print(f" Error fetching page {page_id} ({page_title}): {e}")
except Exception as e:
print(f" An unexpected error occurred for page {page_id} ({page_title}): {e}")
if __name__ == "__main__":
main()
Logic Explanation:
- The script starts by loading your Confluence URL, email, API token, and desired space keys from environment variables for security.
-
get_confluence_pagespaginates through all pages within a specified Confluence space using the v1 REST API endpoint/rest/api/content, expandingbody.viewto get the rendered HTML content andversionfor modification dates. -
get_page_contentfetches the full content of a specific page ID. -
convert_html_to_markdownutilizes thehtml2textlibrary to transform the fetched HTML into Markdown. We configure it to retain images as HTMLimgtags, which often integrate better with static site generators. -
generate_front_mattercreates the YAML front matter expected by Hugo or Jekyll, including title, publication date (date), last modification date (lastmod), and categories/tags derived from the Confluence space. - The
mainfunction iterates through each specified Confluence space, fetches its pages, converts them, and saves them to individual.mdfiles in themarkdown_outputdirectory. - The
slugifyfunction ensures filenames are clean and URL-friendly.
3.2. Run the Script
With your environment variables set and the script ready, execute it from your terminal:
python3 migrate.py
You should see output indicating pages being processed, and a new markdown_output directory will be populated with your Confluence content in Markdown format.
Step 4: Integrate with Hugo/Jekyll
The final step is to incorporate the generated Markdown files into your static site generator project.
-
Hugo: Copy the
.mdfiles frommarkdown_outputinto your Hugo projectâscontent/postsdirectory (or any other content section you prefer). -
Jekyll: Place the
.mdfiles into your Jekyll projectâs_postsdirectory. Remember Jekyll often expects filenames in the formatYYYY-MM-DD-title.md. You might need to adjust theslugifyand filename generation in the script to prepend the date.
After placing the files, run your static site generatorâs local server (e.g., hugo server or bundle exec jekyll serve) to preview the migrated content and make any necessary style or formatting adjustments.
Common Pitfalls
-
API Rate Limiting: Confluence Cloud APIs have rate limits. If youâre migrating a very large number of pages, you might hit these limits, resulting in
429 Too Many Requestserrors. Implement a retry mechanism with exponential backoff if this becomes an issue. - Complex HTML Conversion:</ éąäżăȘă/strong> While
html2textis good, Confluenceâs rich editor can generate highly complex HTML, including embedded macros, custom CSS, or specific table structures that might not convert perfectly to Markdown. Manual review and post-conversion cleanup of the Markdown files may be necessary, especially for heavily formatted pages. -
Missing Attachments/Images: This script focuses on text content. Embedded images are converted to their
<img>tags, but the images themselves are not downloaded. A more advanced script would need to identify image URLs, download them, and update Markdown references to point to local assets. -
Authentication Errors: Double-check your
CONFLUENCE_URL,CONFLUENCE_EMAIL, andCONFLUENCE_API_TOKENvalues. A common mistake is using your regular password instead of an API token, or having typos in the URL.
Conclusion
Youâve successfully automated the often daunting task of migrating Confluence pages to Markdown. This process empowers you to take control of your content, making it portable and future-proof. By leveraging static site generators, you gain benefits like improved performance, enhanced security, simplified hosting, and seamless integration with modern DevOps workflows.
Consider this script a solid foundation. Next steps could involve:
- Automating image and attachment migration.
- Implementing more sophisticated error handling and logging.
- Adding support for Confluence blog posts or other content types.
- Creating a continuous integration pipeline to periodically sync changes from Confluence to your static site.
Embrace the power of automation and keep your documentation agile and accessible!
đ Read the original article on TechResolve.blog
â Support my work
If this article helped you, you can buy me a coffee:

Top comments (0)