DEV Community

Cover image for Abstract Syntax Tree In TypeScript

Abstract Syntax Tree In TypeScript

bilel salem on June 05, 2024

Hello everyone, السلام عليكم و رحمة الله و بركاته Abstract Syntax Trees (ASTs) are a fundamental concept in computer science, particularly in the ...
Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

So, you're simply using ChatGPT for creating you're blog post AND the comments and not even mention it, as you should by dev.to guidelInes?

Collapse
 
nazguul profile image
Younes Habbal

At least he gave you information, bro. take it or shut up.

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

Did he? I don't see any value.

Collapse
 
efpage profile image
Eckehard

You are my hero! I´m trying to build a markdown parser from scratch. Ok, there are many implementations, but non of them fits my needs. I know that I should use a syntax tree, but I´m not sure about the implementation. By the way, I want to use the same markdown syntax most parsers use (e.g. this), but some elements are tricky to capture:

  • Ordered Lists do not have a fixed delimiter, any number followed by a dot can be used
  • A square bracket can be a link, if it is followed by a round bracket, other wise it´s just a square bracket. Any hints how to get the right approach? For some reason I do not want to use a parser generator, so I´m building anything from scratch.
Collapse
 
bilelsalemdev profile image
bilel salem

Ordered Lists

Ordered lists in Markdown can have any number followed by a dot (1., 2., etc.).

Tokenization Strategy:

  • Look for lines starting with a pattern \d+\. (one or more digits followed by a dot).
  • Capture the number and the list item content.

Example:

1. First item
2. Second item
10. Tenth item
Enter fullscreen mode Exit fullscreen mode

Parsing Strategy:

  • Create a ListNode for the ordered list.
  • Create ListItemNode children for each list item.

Links

Links in Markdown have the format [text](url).

Tokenization Strategy:

  • Look for patterns matching \[.*?\]\(.*?\).
  • Capture the link text and URL separately.

Example:

[OpenAI](https://www.openai.com)
Enter fullscreen mode Exit fullscreen mode

Parsing Strategy:

  • Create a LinkNode with text and url attributes.

Example Implementation in Python

Here is a simplified example of how you might start implementing this in Python:

import re

class TokenType:
    TEXT = 'TEXT'
    HEADING = 'HEADING'
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM'
    LINK = 'LINK'

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

def tokenize(markdown):
    tokens = []
    lines = markdown.split('\n')

    for line in lines:
        if re.match(r'^\d+\.\s', line):
            tokens.append(Token(TokenType.ORDERED_LIST_ITEM, line))
        elif re.match(r'^\[.*?\]\(.*?\)$', line):
            tokens.append(Token(TokenType.LINK, line))
        else:
            tokens.append(Token(TokenType.TEXT, line))

    return tokens

class NodeType:
    DOCUMENT = 'DOCUMENT'
    PARAGRAPH = 'PARAGRAPH'
    ORDERED_LIST = 'ORDERED_LIST'
    LIST_ITEM = 'LIST_ITEM'
    LINK = 'LINK'

class Node:
    def __init__(self, type, value=None):
        self.type = type
        self.value = value
        self.children = []

def parse(tokens):
    root = Node(NodeType.DOCUMENT)
    current_node = root

    for token in tokens:
        if token.type == TokenType.ORDERED_LIST_ITEM:
            list_item_node = Node(NodeType.LIST_ITEM, token.value)
            if current_node.type != NodeType.ORDERED_LIST:
                current_node = Node(NodeType.ORDERED_LIST)
                root.children.append(current_node)
            current_node.children.append(list_item_node)
        elif token.type == TokenType.LINK:
            match = re.match(r'^\[(.*?)\]\((.*?)\)$', token.value)
            link_node = Node(NodeType.LINK, {'text': match.group(1), 'url': match.group(2)})
            root.children.append(link_node)
        else:
            paragraph_node = Node(NodeType.PARAGRAPH, token.value)
            root.children.append(paragraph_node)

    return root

def render_html(node):
    if node.type == NodeType.DOCUMENT:
        return ''.join(render_html(child) for child in node.children)
    elif node.type == NodeType.PARAGRAPH:
        return f'<p>{node.value}</p>'
    elif node.type == NodeType.ORDERED_LIST:
        return f'<ol>{" ".join(render_html(child) for child in node.children)}</ol>'
    elif node.type == NodeType.LIST_ITEM:
        return f'<li>{node.value}</li>'
    elif node.type == NodeType.LINK:
        return f'<a href="{node.value["url"]}">{node.value["text"]}</a>'

# Example usage
markdown_text = """1. First item
2. Second item
[OpenAI](https://www.openai.com)"""

tokens = tokenize(markdown_text)
ast = parse(tokens)
html = render_html(ast)

print(html)
Enter fullscreen mode Exit fullscreen mode

Explanation

  1. Tokenize: The tokenize function processes each line of the Markdown input and generates tokens based on patterns.
  2. Parse: The parse function converts tokens into an AST. It recognizes ordered list items and links, nesting them appropriately.
  3. Render: The render_html function traverses the AST to generate HTML output.

This is a basic framework. You can extend it by adding more sophisticated handling for other Markdown features like unordered lists, blockquotes, code blocks, etc. Additionally, refining the tokenization and parsing logic will help in accurately capturing and rendering all Markdown syntax.

Collapse
 
efpage profile image
Eckehard • Edited

Thank you so much for your answer!

Please correct me, I´m not used to Python and not very good at reading RegEx, but If I understand it right, you might get trouble with nested brackets like this (?)

[[Link-Text]](URL.com)

Thread Thread
 
bilelsalemdev profile image
bilel salem

You're correct. The regular expression I provided for links doesn't handle nested brackets properly. Nested brackets can indeed cause issues because the regex pattern \[.*?\]\(.*?\) will greedily match the first closing bracket, leading to incorrect tokenization.

To handle nested brackets correctly, we need a more sophisticated approach. One way is to use a stack to track the brackets during the tokenization phase. This approach can manage nested brackets by ensuring each opening bracket has a corresponding closing bracket.

Here is the code example in typescript :

enum TokenType {
    TEXT = 'TEXT',
    HEADING = 'HEADING',
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM',
    LINK = 'LINK'
}

class Token {
    type: TokenType;
    value: string;

    constructor(type: TokenType, value: string) {
        this.type = type;
        this.value = value;
    }
}

function tokenize(markdown: string): Token[] {
    const tokens: Token[] = [];
    const lines = markdown.split('\n');

    for (const line of lines) {
        if (/^\d+\.\s/.test(line)) {
            tokens.push(new Token(TokenType.ORDERED_LIST_ITEM, line));
        } else if (/\[.*?\]\(.*?\)/.test(line)) {
            tokens.push(new Token(TokenType.LINK, line));
        } else {
            tokens.push(new Token(TokenType.TEXT, line));
        }
    }

    return tokens;
}

enum NodeType {
    DOCUMENT = 'DOCUMENT',
    PARAGRAPH = 'PARAGRAPH',
    ORDERED_LIST = 'ORDERED_LIST',
    LIST_ITEM = 'LIST_ITEM',
    LINK = 'LINK'
}

class Node {
    type: NodeType;
    value: any;
    children: Node[];

    constructor(type: NodeType, value: any = null) {
        this.type = type;
        this.value = value;
        this.children = [];
    }
}

function parse(tokens: Token[]): Node {
    const root = new Node(NodeType.DOCUMENT);
    let currentNode = root;

    for (const token of tokens) {
        if (token.type === TokenType.ORDERED_LIST_ITEM) {
            const listItemNode = new Node(NodeType.LIST_ITEM, token.value);
            if (currentNode.type !== NodeType.ORDERED_LIST) {
                currentNode = new Node(NodeType.ORDERED_LIST);
                root.children.push(currentNode);
            }
            currentNode.children.push(listItemNode);
        } else if (token.type === TokenType.LINK) {
            const linkNode = parseLink(token.value);
            root.children.push(linkNode);
        } else {
            const paragraphNode = new Node(NodeType.PARAGRAPH, token.value);
            root.children.push(paragraphNode);
        }
    }

    return root;
}

function parseLink(text: string): Node {
    const stack: string[] = [];
    let linkText = '';
    let url = '';
    let i = 0;
    let insideLinkText = false;

    while (i < text.length) {
        const char = text[i];
        if (char === '[') {
            stack.push(char);
            if (!insideLinkText) {
                insideLinkText = true;
                i++;
                continue;
            }
        } else if (char === ']' && stack[stack.length - 1] === '[') {
            stack.pop();
            if (stack.length === 0) {
                insideLinkText = false;
                i += 2;  // Skip "]("
                continue;
            }
        } else if (char === '(' && !insideLinkText) {
            stack.push(char);
        } else if (char === ')' && stack[stack.length - 1] === '(') {
            stack.pop();
            if (stack.length === 0) {
                break;
            }
        }

        if (insideLinkText) {
            linkText += char;
        } else {
            url += char;
        }

        i++;
    }

    return new Node(NodeType.LINK, { text: linkText, url: url });
}

function renderHTML(node: Node): string {
    switch (node.type) {
        case NodeType.DOCUMENT:
            return node.children.map(renderHTML).join('');
        case NodeType.PARAGRAPH:
            return `<p>${node.value}</p>`;
        case NodeType.ORDERED_LIST:
            return `<ol>${node.children.map(renderHTML).join('')}</ol>`;
        case NodeType.LIST_ITEM:
            return `<li>${node.value}</li>`;
        case NodeType.LINK:
            return `<a href="${node.value.url}">${node.value.text}</a>`;
        default:
            return '';
    }
}

// Example usage
const markdownText = `1. First item
2. Second item
[[Link-Text]](https://www.example.com)`;

const tokens = tokenize(markdownText);
const ast = parse(tokens);
const html = renderHTML(ast);

console.log(html);
Enter fullscreen mode Exit fullscreen mode

The parseLink function uses a stack to handle nested brackets, ensuring that link text and URLs are correctly identified even when nested brackets are present.

Collapse
 
zeedu_dev profile image
Eduardo Zepeda

Was this post created with ChatGPT?

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

Yes, it was.

Collapse
 
lois_lane_4cf8b9f4dfd6af5 profile image
Lois Lane

This code didn't work for me. I had to modify the traverse function to pass source file into node.getText().

Collapse
 
ahmed_mansour_41e01da3ab7 profile image
ahmed mansour

Thanks