How Markdown Parsers Actually Work Under the Hood

#markdown #javascript #webdev #tutorial

Markdown to HTML conversion looks simple until you try to build a parser. The original Markdown specification by John Gruber is a 3,500-word document with enough ambiguity to produce dozens of incompatible implementations. Understanding how parsers work helps you write markdown that renders consistently everywhere.

The parsing pipeline

Every markdown parser follows roughly the same architecture:

Lexing/Tokenizing - Break the input into tokens (headings, paragraphs, code blocks, lists, etc.)
Parsing - Build a tree structure from the tokens
Rendering - Walk the tree and output HTML

The simplest possible markdown-to-HTML converter for a single feature:

function headingsToHtml(markdown) {
  return markdown.replace(/^(#{1,6})\s+(.+)$/gm, (match, hashes, text) => {
    const level = hashes.length;
    return `<h${level}>${text}</h${level}>`;
  });
}

headingsToHtml("# Hello\n## World");
// "<h1>Hello</h1>\n<h2>World</h2>"

This regex approach works for headings in isolation but fails completely when you need to handle nested structures, paragraph detection, and the interactions between different block-level elements.

Block-level parsing

Block-level elements are the top-level structures: headings, paragraphs, code blocks, lists, blockquotes, and horizontal rules. The parser processes the document line by line, identifying which block type each line belongs to.

function identifyBlock(line) {
  if (/^#{1,6}\s/.test(line)) return 'heading';
  if (/^```
{% endraw %}
/.test(line)) return 'code_fence';
  if (/^>\s/.test(line)) return 'blockquote';
  if (/^[-*+]\s/.test(line)) return 'unordered_list';
  if (/^\d+\.\s/.test(line)) return 'ordered_list';
  if (/^---$|^\*\*\*$|^___$/.test(line)) return 'hr';
  if (/^\s*$/.test(line)) return 'blank';
  return 'paragraph';
}
{% raw %}

The complexity comes from multi-line blocks. A code block continues until the closing fence. A list continues until a blank line followed by a non-list item. Blockquotes can contain other block elements. These interactions create the nested structure that requires a proper parser, not just regex.

Inline parsing

After block-level parsing, each block's content is parsed for inline elements: bold, italic, code, links, and images.


javascript
function parseInline(text) {
  return text
    .replace(/\*\*(.+?)\*\*/g, '<strong>$1</strong>')
    .replace(/\*(.+?)\*/g, '<em>$1</em>')
    .replace(/`(.+?)`/g, '<code>$1</code>')
    .replace(/\[(.+?)\]\((.+?)\)/g, '<a href="$2">$1</a>')
    .replace(/!\[(.+?)\]\((.+?)\)/g, '<img alt="$1" src="$2">');
}

The order of these replacements matters. Bold (**) must be processed before italic (*), or **bold** gets parsed as two empty italic spans around "bold." Image links (![]()) must be processed before regular links ([]()), or the ! gets left as a literal character.

The ambiguity problem

Consider this markdown:


markdown
* item 1
* item 2

* item 3

Is this one list or two? The original Markdown spec says a blank line between list items creates "loose" list items (wrapped in <p> tags). But some parsers treat it as two separate lists. CommonMark, the most rigorous specification, has detailed rules for this case, but many parsers still disagree.

Another classic ambiguity:


markdown
> blockquote
continued line

Does "continued line" belong to the blockquote or start a new paragraph? Gruber's implementation includes it in the blockquote. Many other parsers start a new paragraph. CommonMark requires the > prefix on each line to be explicit.

CommonMark and GFM

CommonMark is the attempt to create an unambiguous Markdown specification. It has over 600 test cases covering edge cases in the original spec. GitHub Flavored Markdown (GFM) extends CommonMark with tables, task lists, strikethrough, and autolinks.

If you are building anything that processes markdown, target CommonMark compliance. The spec is at commonmark.org and includes a reference implementation in both JavaScript and C.

Security considerations

Converting user-provided markdown to HTML creates XSS (Cross-Site Scripting) vulnerabilities. Markdown allows raw HTML, so a user can inject:


markdown
<script>document.cookie</script>

Every markdown-to-HTML converter used in a web application must sanitize the output. The two approaches:

Strip all HTML tags from the markdown before parsing
Sanitize the HTML output using a library like DOMPurify

Option 2 is preferred because it allows legitimate HTML like <details> and <summary> while blocking dangerous elements.

For quick markdown-to-HTML conversion with proper sanitization, I built a converter at zovo.one/free-tools/markdown-to-html. It handles CommonMark plus GFM extensions and shows the rendered output alongside the HTML source. Useful for previewing markdown before committing it and for generating HTML snippets from markdown sources.

I'm Michael Lip. I build free developer tools at zovo.one. 500+ tools, all private, all free.