DEV Community

Cover image for Task: Save Article to Markdown
Kees C. Bakker
Kees C. Bakker

Posted on β€’ Edited on β€’ Originally published at keestalkstech.com

1 1

Task: Save Article to Markdown

WordPress rules, but I would like my content to be on other platforms as well. Some platforms like DEV, use Markdown, but I seem to struggle to import my articles. That's why I created a small snippet application to convert an article to markdown.

Final result

Just paste the URL of this blog into this small reppl.it program and watch how it converts the article into a big string of Markdown:

Packages

This solution uses Node.js. NPM has some great packages to work with:

Install them like:

npm install node-fetch@2 linkedom node-html-markdown
npm install -D @types/node-fetch
Enter fullscreen mode Exit fullscreen mode

Simple scraper

We're going to do the following:

  1. Fetch the text of the URL. This is HTML, of course.
  2. Parse it to DOM nodes.
  3. Detect the article node.
  4. Convert the article node to Markdown.

This results in the following lines of code:

async function scrape(url: string) {
  let f = await fetch(url)
  let txt = await f.text()

  const { document } = parseHTML(txt)

  // custom parsing:
  // parseCodeFields(document)
  // parseEmbeds(document)

  let article = (
    document.querySelector('article .entry-content') ||
    document.querySelector('article .crayons-article__main') ||
    document.querySelector('article') ||
    document.querySelector('body'))

  let html = article?.innerHTML || ""
  let content = NodeHtmlMarkdown.translate(html).trim()
  // let header = parseHeader(document)
  // content = header + content

  return content
}
Enter fullscreen mode Exit fullscreen mode

Code Language Support

Now, my WordPress generates <pre class="lang-ts"><code></code></pre> blocks. Looks like node-html-markdown only takes <pre><code class="language-ts></code></pre>. Now, that's easily fixed by adding some extra processing before converting the document to markdown:

function parseCodeFields(document: Document) {
  document.querySelectorAll("pre code").forEach(code => {
    let lang = [...code.parentElement?.classList || []]
      .filter(x => x.startsWith("lang-"))
      .find(x => x)

    if(!lang) return

    lang = lang.replace("lang-", "language-")
    code.classList.add(lang)
  })
}
Enter fullscreen mode Exit fullscreen mode

Embed rich content

Fortunately, dev.to supports liquid tags to embed rich content like repl.it and tweets. Let's parse our iframe elements into a liquid tag:

function parseEmbeds(document: Document) {
  document.querySelectorAll('iframe').forEach(iframe => {
    if (!iframe.src) return

    const url = new URL(iframe.src)
    const type = url.host
    const name = url.pathname

    const p = document.createElement("p")
    const n = document.createTextNode(`{% ${type} ${name} %}`)
    p.appendChild(n)

    iframe.parentNode?.insertBefore(p, iframe)
  })
}
Enter fullscreen mode Exit fullscreen mode

This will not work for every embed, but it will get you started.

Header support

To be complete, we also need to add a YAML header with the title, tags and the canonical URL. It requires some parsing, but it'll make things easier:

function parseHeader(document: Document) {

  let header = '---\n'

  let title = (document.querySelector('h1')?.textContent || '').trim()
  if (title) {
    header += `title: ${title}\n`
  }

  let tags = [...document.querySelectorAll(".categories a, .tags a")]
    .map(a => (a.textContent || '').trim().toLowerCase())
    .filter(t => t)
  if (tags.length > 0) {
    tags.sort()
    let t = [... new Set(tags)].join(", ")
    header += `tags: [${t}]\n`
  }

  let canonical = document.querySelector('link[rel=canonical]')?.getAttribute("href")
  if (canonical) {
    header += `canonical_url: ${canonical}\n`
  }

  header += '---\n\n'

  return header;
}
Enter fullscreen mode Exit fullscreen mode

Final thoughts

I still need to find a better way to detect the language of code snippets, so I don't have to add them by hand. When I look at the result, I know one thing for sure: I'll keep using WordPress to write my blogs, as Markdown does not make it more readable!

Oh, and when you read this post on dev.to: it was created using this code (and yes, that's super meta πŸ€“).

Changelog

  • 2022-08-31 Initial article.
  • 2022-01-09 Fixed language support for WordPress code fields (see Code Language Support).
  • 2022-03-09 Fixed embedding of repl.it through iframe parsing (see Embed rich content).
  • 2022-03-09 Added title support.
  • 2022-03-09 Added YAML header support (see Header support)

Image of Datadog

How to Diagram Your Cloud Architecture

Cloud architecture diagrams provide critical visibility into the resources in your environment and how they’re connected. In our latest eBook, AWS Solution Architects Jason Mimick and James Wenzel walk through best practices on how to build effective and professional diagrams.

Download the Free eBook

Top comments (0)

Image of Datadog

The Essential Toolkit for Front-end Developers

Take a user-centric approach to front-end monitoring that evolves alongside increasingly complex frameworks and single-page applications.

Get The Kit

πŸ‘‹ Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay