DEV Community

Cover image for Build a markdown parser from scratch (Tokenizer)
Kinanee Samson
Kinanee Samson

Posted on

Build a markdown parser from scratch (Tokenizer)

How many of you have ever had the desire to build a parser from scratch? Since you started using your favorite programming language, it must have occurred to you at some point. You have asked yourself countless times, How is this jargon I'm writing being compiled or interpreted?

You may know a thing or two about compilers and interpreters, but you haven't taken the time to implement or build one of your own. Well, look no further because in today's video, I will take my time to teach you how to make a parser for the markup language (Markdown).

In this video, you'll discover numerous amazing and useful concepts that will 100% make you a 10X developer; heck, you might even create your programming language after you're done with this video. You will learn things like

  • Regular Expressions (RegEx) for Pattern Matching.
  • Object-Oriented Programming (OOP)
  • Fundamentals of a Compiler/Interpreter Pipeline (Lexical Analysis (Tokenization), Syntactic Analysis (Parsing), and Code Generation (Compilation))
  • Recursive Parsing and Compilation

If you prefer watching video content, then you can watch the video on YouTube.

Let’s create a type TokenType which will be a union type of a HEADING, PARAGRAPH, NEWLINE, TEXT and SPACE. Next, I will create an interface Token that will serve as the interface for valid tokens. It will have a type property, which will be a TokenType. It will also have a value property, which is a string; it will have an optional level property, which is a number; an optional alt property, which is a string; an optional url property, which is a string; it will also have an optional indent property, which is a string. This interface will represent an identified token, while the tokenType is self-explanatory.

export type TokenType =
  | "HEADING"
  | "PARAGRAPH"
  | "NEWLINE"
  | "TEXT"
  | "SPACE";

export interface Token {
  type: TokenType;
  value: string;
  level?: number;
  alt?: string;
  url?: string;
  indent?: number;
}

Enter fullscreen mode Exit fullscreen mode

Next, I will define an interface ASTNode, which will represent an identified node. It will have a type property, which is of tokenType, and it will have a children's property, which is an array of ASTNode, and it will have an optional value property, which is a string, and a level property, which is a string. It will also have an optional level property, which is a string; it will also have an optional alt property, which is a string, and it will also have an optional indent property, which is a number, and finally, it will have an optional lang property, which is also a string.

export interface ASTNode {
  type: TokenType;
  children?: ASTNode[];
  value?: string;
  level?: number;
  alt?: string;
  url?: string;
  indent?: number;
  lang?: string;
}
Enter fullscreen mode Exit fullscreen mode

We need to implement a tokenizer; it is the job of the tokenizer to go over the text string we want to parse and identify meaningful tokens inside it. Imagine you have a long piece of text, like a sentence or a paragraph. The tokenizer is like a specialized reader that goes through this text and breaks it down into individual,
meaningful "words" or "chunks." These chunks aren't English words; rather,
they're defined by specific rules. In programming, we call these chunks tokens.

Algorithm for Tokenizer

  • Start with an empty list called tokens (this is where we'll store all the identified tokens).
  • Set a position pointer to the beginning of the input string.
  • Create a predefined list of patterns for the types of elements you want to recognize.
  • Each pattern consists of A regular expression (regex): The search rule to find the element. A token type: What to call this element if it's found (e.g., "HEADING", "LINK", "TEXT"). An optional handler function: A small piece of code that will extract specific details from the matched text (like the heading level or a link's URL) and put them into the token.
  • Keep looping as long as the position pointer hasn't reached the end of the Input string.
  • Inside the loop, for the current position, try to match each predefined pattern (regex) against the remaining part of the input string, starting exactly at the position.
  • Reset the regex's internal lastIndex before each attempt to ensure it always starts matching from the correct spot.
  • If a pattern successfully matches at the current position
    • Get the full text that was matched.
    • Use the optional handler function (if it exists) to extract any special data for this type of token (e.g., level for headings, url for links, alt for images). If no handler, just use the matched text as the token's value.
    • Create a new Token object with the correct type and the extracted data.
    • Add this new Token to your tokens list.
    • Advance the position pointer by the length of the matched text, so the next search starts right after the token you just found.
    • Stop checking other patterns for this position and go back to step 3 (Start the loop again for the new position.)
  • If no pattern matches at the current position:
    • This means the character (or sequence of characters) isn't recognized by any of your predefined rules.
    • For the tokenizeInline method, it typically creates a "TEXT" token for the single character it couldn't match.
  • Advance the position pointer by one character, effectively skipping over the unrecognized part (or treating it as plain text if that's the fallback behavior).
  • Once the loop finishes (meaning you've gone through the entire input string)
  • Return the tokens list. This list is now ready for the next stage: the parser.
class Tokenizer {
  tokenize(input: string): Token[] {
    const tokens: Token[] = [];

    let pos = 0;

    const blockPatterns: [RegExp, TokenType, Function?][] = [
      // * HEADING
      [
        /^#{1,6}\s[^\n]*(?:\n|$)/,
        "HEADING",
        (match: string) => ({
          level: match.trim().split(" ")[0].length,
          value: match.replace(/^#+\s/, "").trim(),
        }),
      ],
      //*  NEWLINE
      [/\n+/, "NEWLINE"],

      // * TEXT
      [/^(?!#|!|\[|-)[^\n]+/, "TEXT", (match: string) => ({ value: match })],

      // * Links
      [
        /^\[([^\]]+)\]\(([^)]+)\)/,
        "LINK",
        (match: string, text: string, url: string) => ({
          value: text,
          url,
        }),
      ],
      // * Images
      [
        /^!\[([^\]]+)\]\(([^)]+)\)/,
        "IMAGE",
        (match: string, alt: string, url: string) => ({
          alt,
          url,
          value: match,
        }),
      ],

      // * LIST_ITEM
      [
        /^(\s*)([-*+])\s+([^\n]*)(?:\n|$)/,
        "LISTITEM",
        (match: string, indent: string) => ({
          value: match.replace(/^\s*[-*+]\s+/, "").trim(),
          indent: indent.length,
        }),
      ],
    ];

    while (pos < input.length) {
      let matched = false;

      for (const [regexp, type, handler] of blockPatterns) {
        regexp.lastIndex = 0;
        const match = regexp.exec(input.slice(pos));
        if (match && match.index === 0) {
          const tokenData = handler ? handler(...match) : { value: match[0] };
          tokens.push({
            type,
            ...tokenData,
          } as Token);

          pos += match[0].length;
          matched = true;
          break;
        }
      }

      if (!matched) {
        pos++;
      }
    }
    return tokens;
  }
}
Enter fullscreen mode Exit fullscreen mode

Then I will create a class Tokenizer. Inside this token class, I will define a tokenize function that accepts an input, which is a string, and returns an array of tokens. Don't worry, I will
fix this bug shortly. Inside the function, I will create a variable tokens, which will be an array of tokens. All identified tokens will be stored inside this array. I will create a variable pos, which will be equal to 0. This will serve as a cursor that will go over the text and identify tokens.

I will create a variable blockPattern, which will be an array of tuples. The first item will be a regular expression, the second item will be the type for the token, and the third argument will be an optional function. This will be equal to an empty array. The first item inside this array will be an array
itself, and the first item will be a regular expression that matches a markdown heading The second item, which is the type, will be a heading, and the last item will be a function that accepts a match, which is a string, and inside it we return an object with a value property that will match the actual
heading and a level property, which replaces the number of headings for that particular heading element.

We will add another array to the parent array, and this will allow us to identify a new line token from the string. We will attempt to only parse a heading and a newline for now. I will now add a while loop that will run as long as we have not gotten to the end of the input string. Then I will create a variable matched, which will be false.

Then I will loop through the block patterns array, extracting the regex, type, and handler function from each array inside, then I will set the lastIndex property on the regex to be equal to zero, then we will attempt to extract a token from the text by first creating a variable match, which stores the result of calling the exec method on the regexp, we will ensure that we slice the input starting from the position of the cursor.

Then I will check that there is a match and match.index is equal to zero, and if this is the case, I will create a tokenData variable, the value of this variable will be conditional. First, we will check if the handler function exists. If it does, I will call the handler function with the result of the match, but if the handler does not exist, then the tokenDatawill be equal to an object with a value property whose value will be the first item in the match array, and this stores the values that were matched against the regular expression.

Then I will push the token inside the tokens array and increment the pos(the cursor) by the length of the matched item, set matched to true, and break out of this inner for loop. After the for loop, we will just increment the cursor by one if there is no match, and after the while loop, I will return the tokens array.

This tokenizer is responsible for block-level tokenization. The method (tokenize) focuses on identifying larger, structural pieces of your text, like entire headings, links, or images. It processes these "blocks" line by line or by significant patterns. It uses a list of predefined rules, each containing a regular expression (regex). When it finds a match, it grabs that chunk of text, determines its type (e.g., "HEADING", "LINK"), and extracts any specific information related to it (like the heading level or the link's URL). It then moves past that matched chunk and continues searching in the rest of the text.

const tokenizer = new Tokenizer()
console.log(tokenizer.tokenizer(`
## Hello World
`))

// [
//   { type: "NEWLINE", value: "\n" },
//   { type: "HEADING", value: "Hello World", level: 2 }
// ]
Enter fullscreen mode Exit fullscreen mode

So let's test out this class by creating a variable tokenizer which stores the result of instantiating the tokenizer class, then I will attempt to tokenize a basic markdown string with just a two-level heading, and logging out the tokens array to the console.

You can observe that there are two tokens in the array, one corresponding to a new line, while the second is a heading where the level property is two. The next thing would be to tokenize inline characters. We need to split the value obtained from a heading into individual tokens, and this is the job of the tokenizeInline method. This method will not be called directly; rather, it will be called by the parser.

class Tokenizer {
  tokenize(input: string){
    // cont'd
  }

  tokenizeInline(input: string): Token[] {
    let tokens: Token[] = [];
    let cursor = 0;

    const inlinePatterns: [RegExp, TokenType, Function?][] = [
      [/\s+/, "SPACE"],
      [/[^\s*_![\n]+]/, "TEXT"],
      // Links
      [
        /^\[([^\]]+)\]\(([^)]+)\)/,
        "LINK",
        (match: string, text: string, url: string) => ({
          value: text,
          url,
        }),
      ],
      // images
      [
        /^!\[([^\]]+)\]\(([^)]+)\)/,
        "IMAGE",
        (match: string, alt: string, url: string) => ({
          alt,
          url,
          value: match,
        }),
      ],

      // Bold
      [
        /(\*\*|__)([^\n]+?)\1/,
        "BOLD",
        (match: string, _: any, content: string) => ({
          value: content,
        }),
      ],
      // Italic
      [
        /(\*|_)([^\n]+?)\1/,
        "ITALIC",
        (match: string, _: any, content: string) => ({
          value: content,
        }),
      ],
    ];

    while (cursor < input.length) {
      let matched = false;

      for (const [regexp, type, handler] of inlinePatterns) {
        regexp.lastIndex = 0;
        const match = regexp.exec(input.slice(cursor));

        if (match && match.index === 0) {
          const tokenData = handler ? handler(...match) : { value: match[0] };

          tokens.push({
            type,
            ...tokenData,
          } as Token);

          cursor += match[0].length;
          matched = true;
          break;
        }
      }

      if (!matched) {
        tokens.push({ type: "TEXT", value: input[cursor] });
        cursor++;
      }
    }
    return tokens;
  }
}
Enter fullscreen mode Exit fullscreen mode

Now let's implement the tokenizeInline method, which will accept an input argument, which is a string, and it will return an array of tokens. Inside this method, I will create a variable tokens, which is an array of tokens; this variable will store tokens extracted from the input. I will create another
variable cursor, which will be equal to 0.

I will now create an inlinePatterns variable, which is similar to the blockPatterns variable we created when implementing the tokenize method. The inlinePatterns will be an array of tuples, where the first item in the tuple is a regular expression, the second item is the type of the token, while the third item is an optional function.

The first tuple will identify white spaces, while the second tuple will identify and extract text tokens after the inlinePatterns array. I will add a while loop that will run until we have consumed all of the tokens from the input. Inside this while loop, first, I will create a variable matched that tracks whether or not a token was matched, and then create a for loop that goes over the inlinePatterns array, extracting the regular expression, the type, and the handler function.

Inside the for loop, I will set regexp.lastIndex equal to 0, then I will try to identify a token. This will be done by creating a variable match, which is equal to the result of calling the exec method on the regexp I will pass the input string as an argument to it, but I will ensure that we slice the input string with the cursor to ensure that we don't continue parsing tokens we have already consumed.

If there's a match and the index of the match is equal to zero, then I will create a variable tokenData, which will be equal to the result of calling the handler function with the match, if there's a handler function, otherwise I will just return an object with a value property whose value is the first item in the match array.

I will push a new token inside the tokens array, the type for that token will be the matched type, while I will spread the tokenData into the rest of the token, then I will increment the cursor by the length of the value matched which is stored in the first index inside the match array, I will also set matched to be true and break out of this inner for loop.

If matched is false, then we want to push a text token into the tokens array and increment the cursor by one. I will exit the while loop and ensure that I return the identified tokens at the end of the function.

Top comments (0)