DevTip#3: Strip Down and Convert HTML to Markdown for Importing Notes into Joplin

#tutorial #techtip #tooling #productivity

▶ Background

🛑 Problem Statement

📜 Solution

1️⃣ Step 1

2️⃣ Step 2

✉ Summary

▶ Background ↑top

Even though this article specifically mentioned about importing notes into Joplin, the steps laid out are generic. Most of the details of exporting and importing notes are left out to keep this article as concise as possible. If you are just looking for the steps to clean and convert HTML to Markdown, skip right into the Solution.

I had been using OneNote, Evernote and Leanote as my note taking software. OneNote is primarily used for personal notes while Evernote is used for work. Later on, I switched to self-hosted Leanote from Evernote and used that for the past 2-3 years.

Now, I plan to consolidate both OneNote and Leanote contents into one place. In the end I decided to settle for Joplin. This article focus on part of the whole export and import notes process. More details are laid out in the Solution section below.

🛑 Problem Statement ↑top

A quick web search will return results on how to import OneNote and Leanote into Joplin. Despite the methods from the search results work considerably well for simple notes format, the imported note looks messed up once the note contains tables, images, styles, nested structures and combination of all these.

📜 Solution ↑top

After some trial and error, the following workflow works the best.

exporting all the notes from OneNote and Leanote into HTML,
remove all HTML attributes except for src and href (kept for image source and links),
follow by converting them into Markdown,
finally importing all the markdown notes into Joplin using the built-in importer

This article will focus on the 2 intermediary steps - sanitize & convert HTML into Markdown.

1️⃣ Step 1 ↑top

A typical HTML will contains script, id, class, style, data attributes and much more other HTML tags/attributes. As most of these are not supported in Markdown, removal of all HTML tags/attributes while remaining only a handful of what we need will drastically the markdown output later on.

Beautiful Soup 4 is used for this step. The command below uses pip to install the python library.

pip install beautifulsoup4

After installing, running the code example below will remove all HTML attributes except for src and href, and removing all script and style tags. I also posted this solution in this SO Q&A.

# https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree
from bs4 import BeautifulSoup, NavigableString

def unstyle_html(html):
    soup = BeautifulSoup(html, features="html.parser")

    # remove all attributes except for `src` and `href`
    for tag in soup.descendants:
        keys = []
        if not isinstance(tag, NavigableString):
            for k in tag.attrs.keys():
                if k not in ["src", "href"]:
                    keys.append(k)
            for k in keys:
                del tag[k]

    # remove all script and style tags
    for tag in soup.find_all(["script", "style"]):
        tag.decompose()

    # return html text
    return soup.prettify()

2️⃣ Step 2 ↑top

To convert from the sanitize HTML into Markdown, pandoc is used. This is a command line tool, an external library is installed using pip to use pandoc easier in Python.

pip install pypandoc

After installing, the code snippet below shows how to call pypandoc to convert HTML into Markdown.

pypandoc.convert_file(html_path,
    'markdown+pipe_tables+backtick_code_blocks-markdown_attribute',
    format='html',
    outputfile=md_path)

This pandoc documentation shows all the supported input and output formats. If you are curious about the 'plus' and 'minus' strings after the format, those are for adding or removing pandoc extensions respectively. The Markdown files generated using these extensions provide the best imported Joplin notes. Check out this section to understand more details about the extensions.