DEV Community

Abstract Syntax Tree In TypeScript

bilel salem on June 05, 2024

Hello everyone, السلام عليكم و رحمة الله و بركاته Abstract Syntax Trees (ASTs) are a fundamental concept in computer science, particularly in the ...

Read full post

Red Ochsenbein (he/him) • Jun 6 '24

So, you're simply using ChatGPT for creating you're blog post AND the comments and not even mention it, as you should by dev.to guidelInes?

Younes Habbal • Jul 31 '25

At least he gave you information, bro. take it or shut up.

Red Ochsenbein (he/him) • Aug 1 '25

Did he? I don't see any value.

Eckehard • Jun 5 '24

You are my hero! I´m trying to build a markdown parser from scratch. Ok, there are many implementations, but non of them fits my needs. I know that I should use a syntax tree, but I´m not sure about the implementation. By the way, I want to use the same markdown syntax most parsers use (e.g. this), but some elements are tricky to capture:

Ordered Lists do not have a fixed delimiter, any number followed by a dot can be used
A square bracket can be a link, if it is followed by a round bracket, other wise it´s just a square bracket. Any hints how to get the right approach? For some reason I do not want to use a parser generator, so I´m building anything from scratch.

bilel salem • Jun 5 '24

Ordered Lists

Ordered lists in Markdown can have any number followed by a dot (1., 2., etc.).

Tokenization Strategy:

Look for lines starting with a pattern \d+\. (one or more digits followed by a dot).
Capture the number and the list item content.

Example:

1. First item
2. Second item
10. Tenth item

Parsing Strategy:

Create a ListNode for the ordered list.
Create ListItemNode children for each list item.

Links

Links in Markdown have the format [text](url).

Tokenization Strategy:

Look for patterns matching \[.*?\]\(.*?\).
Capture the link text and URL separately.

Example:

[OpenAI](https://www.openai.com)

Parsing Strategy:

Create a LinkNode with text and url attributes.

Example Implementation in Python

Here is a simplified example of how you might start implementing this in Python:

import re

class TokenType:
    TEXT = 'TEXT'
    HEADING = 'HEADING'
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM'
    LINK = 'LINK'

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

def tokenize(markdown):
    tokens = []
    lines = markdown.split('\n')

    for line in lines:
        if re.match(r'^\d+\.\s', line):
            tokens.append(Token(TokenType.ORDERED_LIST_ITEM, line))
        elif re.match(r'^\[.*?\]\(.*?\)$', line):
            tokens.append(Token(TokenType.LINK, line))
        else:
            tokens.append(Token(TokenType.TEXT, line))

    return tokens

class NodeType:
    DOCUMENT = 'DOCUMENT'
    PARAGRAPH = 'PARAGRAPH'
    ORDERED_LIST = 'ORDERED_LIST'
    LIST_ITEM = 'LIST_ITEM'
    LINK = 'LINK'

class Node:
    def __init__(self, type, value=None):
        self.type = type
        self.value = value
        self.children = []

def parse(tokens):
    root = Node(NodeType.DOCUMENT)
    current_node = root

    for token in tokens:
        if token.type == TokenType.ORDERED_LIST_ITEM:
            list_item_node = Node(NodeType.LIST_ITEM, token.value)
            if current_node.type != NodeType.ORDERED_LIST:
                current_node = Node(NodeType.ORDERED_LIST)
                root.children.append(current_node)
            current_node.children.append(list_item_node)
        elif token.type == TokenType.LINK:
            match = re.match(r'^\[(.*?)\]\((.*?)\)$', token.value)
            link_node = Node(NodeType.LINK, {'text': match.group(1), 'url': match.group(2)})
            root.children.append(link_node)
        else:
            paragraph_node = Node(NodeType.PARAGRAPH, token.value)
            root.children.append(paragraph_node)

    return root

def render_html(node):
    if node.type == NodeType.DOCUMENT:
        return ''.join(render_html(child) for child in node.children)
    elif node.type == NodeType.PARAGRAPH:
        return f'<p>{node.value}</p>'
    elif node.type == NodeType.ORDERED_LIST:
        return f'<ol>{" ".join(render_html(child) for child in node.children)}</ol>'
    elif node.type == NodeType.LIST_ITEM:
        return f'<li>{node.value}</li>'
    elif node.type == NodeType.LINK:
        return f'<a href="{node.value["url"]}">{node.value["text"]}</a>'

# Example usage
markdown_text = """1. First item
2. Second item
[OpenAI](https://www.openai.com)"""

tokens = tokenize(markdown_text)
ast = parse(tokens)
html = render_html(ast)

print(html)

Explanation

Tokenize: The tokenize function processes each line of the Markdown input and generates tokens based on patterns.
Parse: The parse function converts tokens into an AST. It recognizes ordered list items and links, nesting them appropriately.
Render: The render_html function traverses the AST to generate HTML output.

This is a basic framework. You can extend it by adding more sophisticated handling for other Markdown features like unordered lists, blockquotes, code blocks, etc. Additionally, refining the tokenization and parsing logic will help in accurately capturing and rendering all Markdown syntax.

Eckehard • Jun 5 '24 • Edited

Thank you so much for your answer!

Please correct me, I´m not used to Python and not very good at reading RegEx, but If I understand it right, you might get trouble with nested brackets like this (?)

[[Link-Text]](URL.com)

bilel salem • Jun 5 '24

You're correct. The regular expression I provided for links doesn't handle nested brackets properly. Nested brackets can indeed cause issues because the regex pattern \[.*?\]\(.*?\) will greedily match the first closing bracket, leading to incorrect tokenization.

To handle nested brackets correctly, we need a more sophisticated approach. One way is to use a stack to track the brackets during the tokenization phase. This approach can manage nested brackets by ensuring each opening bracket has a corresponding closing bracket.

Here is the code example in typescript :

enum TokenType {
    TEXT = 'TEXT',
    HEADING = 'HEADING',
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM',
    LINK = 'LINK'
}

class Token {
    type: TokenType;
    value: string;

    constructor(type: TokenType, value: string) {
        this.type = type;
        this.value = value;
    }
}

function tokenize(markdown: string): Token[] {
    const tokens: Token[] = [];
    const lines = markdown.split('\n');

    for (const line of lines) {
        if (/^\d+\.\s/.test(line)) {
            tokens.push(new Token(TokenType.ORDERED_LIST_ITEM, line));
        } else if (/\[.*?\]\(.*?\)/.test(line)) {
            tokens.push(new Token(TokenType.LINK, line));
        } else {
            tokens.push(new Token(TokenType.TEXT, line));
        }
    }

    return tokens;
}

enum NodeType {
    DOCUMENT = 'DOCUMENT',
    PARAGRAPH = 'PARAGRAPH',
    ORDERED_LIST = 'ORDERED_LIST',
    LIST_ITEM = 'LIST_ITEM',
    LINK = 'LINK'
}

class Node {
    type: NodeType;
    value: any;
    children: Node[];

    constructor(type: NodeType, value: any = null) {
        this.type = type;
        this.value = value;
        this.children = [];
    }
}

function parse(tokens: Token[]): Node {
    const root = new Node(NodeType.DOCUMENT);
    let currentNode = root;

    for (const token of tokens) {
        if (token.type === TokenType.ORDERED_LIST_ITEM) {
            const listItemNode = new Node(NodeType.LIST_ITEM, token.value);
            if (currentNode.type !== NodeType.ORDERED_LIST) {
                currentNode = new Node(NodeType.ORDERED_LIST);
                root.children.push(currentNode);
            }
            currentNode.children.push(listItemNode);
        } else if (token.type === TokenType.LINK) {
            const linkNode = parseLink(token.value);
            root.children.push(linkNode);
        } else {
            const paragraphNode = new Node(NodeType.PARAGRAPH, token.value);
            root.children.push(paragraphNode);
        }
    }

    return root;
}

function parseLink(text: string): Node {
    const stack: string[] = [];
    let linkText = '';
    let url = '';
    let i = 0;
    let insideLinkText = false;

    while (i < text.length) {
        const char = text[i];
        if (char === '[') {
            stack.push(char);
            if (!insideLinkText) {
                insideLinkText = true;
                i++;
                continue;
            }
        } else if (char === ']' && stack[stack.length - 1] === '[') {
            stack.pop();
            if (stack.length === 0) {
                insideLinkText = false;
                i += 2;  // Skip "]("
                continue;
            }
        } else if (char === '(' && !insideLinkText) {
            stack.push(char);
        } else if (char === ')' && stack[stack.length - 1] === '(') {
            stack.pop();
            if (stack.length === 0) {
                break;
            }
        }

        if (insideLinkText) {
            linkText += char;
        } else {
            url += char;
        }

        i++;
    }

    return new Node(NodeType.LINK, { text: linkText, url: url });
}

function renderHTML(node: Node): string {
    switch (node.type) {
        case NodeType.DOCUMENT:
            return node.children.map(renderHTML).join('');
        case NodeType.PARAGRAPH:
            return `<p>${node.value}</p>`;
        case NodeType.ORDERED_LIST:
            return `<ol>${node.children.map(renderHTML).join('')}</ol>`;
        case NodeType.LIST_ITEM:
            return `<li>${node.value}</li>`;
        case NodeType.LINK:
            return `<a href="${node.value.url}">${node.value.text}</a>`;
        default:
            return '';
    }
}

// Example usage
const markdownText = `1. First item
2. Second item
[[Link-Text]](https://www.example.com)`;

const tokens = tokenize(markdownText);
const ast = parse(tokens);
const html = renderHTML(ast);

console.log(html);

The parseLink function uses a stack to handle nested brackets, ensuring that link text and URLs are correctly identified even when nested brackets are present.