DEV Community

Cover image for Abstract Syntax Tree In TypeScript
bilel salem
bilel salem

Posted on

Abstract Syntax Tree In TypeScript

Hello everyone, السلام عليكم و رحمة الله و بركاته

Abstract Syntax Trees (ASTs) are a fundamental concept in computer science, particularly in the realms of programming languages and compilers. While ASTs are not exclusive to TypeScript, they are pervasive across various programming languages and play a crucial role in many aspects of software development.

Importance of ASTs

  1. Compiler Infrastructure: ASTs serve as an intermediary representation of code between its textual form and machine-executable instructions. Compilers and interpreters use ASTs to analyze, transform, and generate code during the compilation process.

  2. Static Analysis: ASTs enable static analysis tools to inspect code for potential errors, security vulnerabilities, or code smells without executing it. Tools like linters, code formatters, and static analyzers leverage ASTs to provide insights into code quality and maintainability.

  3. Language Features: ASTs facilitate the implementation of advanced language features such as type inference, pattern matching, and syntactic sugar. By manipulating ASTs, language designers can introduce new constructs and behaviors into programming languages.

ASTs in TypeScript

While ASTs are not specific to TypeScript, they are integral to the TypeScript compiler's operation and ecosystem. TypeScript's compiler (tsc) parses TypeScript code into an AST representation before type-checking, emitting JavaScript, or performing other compilation tasks. ASTs in TypeScript capture the syntactic and semantic structure of TypeScript code, including type annotations, generics, and other TypeScript-specific features.

How ASTs are Used in TypeScript

  1. Type Checking: TypeScript's type checker traverses the AST to perform type inference, type checking, and type resolution. AST nodes corresponding to variable declarations, function calls, and type annotations are analyzed to ensure type safety and correctness.

  2. Code Transformation: TypeScript's compiler API allows developers to programmatically manipulate ASTs to transform TypeScript code. Custom transformers can be applied to modify AST nodes, enabling tasks such as code optimization, polyfilling, or code generation.

  3. Tooling Support: IDEs and code editors leverage ASTs to provide rich language services for TypeScript developers. Features like code completion, refactoring, and error highlighting rely on ASTs to understand code context and provide accurate feedback to developers.

  4. TypeScript Ecosystem: Various tools and libraries within the TypeScript ecosystem utilize ASTs to enhance developer productivity and enable advanced tooling capabilities. For example, tools like ts-migrate and tslint rely on ASTs to automate code migrations and enforce coding standards.

Example

Suppose we have the following TypeScript code:

function greet(name: string): void {
  console.log("Hello, " + name + "!");
}

greet("Mohamed");
Enter fullscreen mode Exit fullscreen mode

We can use TypeScript's Compiler API to parse this code into an AST and traverse the tree to inspect its structure.

Here's how you can do it programmatically:

import * as ts from "typescript";

// TypeScript code to parse
const code = `
  function greet(name: string): void {
    console.log("Hello, " + name + "!");
  }

  greet("Mohamed");
`;

// Parse the TypeScript code into an AST
const sourceFile = ts.createSourceFile(
  "example.ts",
  code,
  ts.ScriptTarget.Latest
);

// Recursive function to traverse the AST
function traverse(node: ts.Node, depth = 0) {
  console.log(
    `${" ".repeat(depth * 2)}${ts.SyntaxKind[node.kind]} - ${node.getText()}`
  );
  ts.forEachChild(node, (childNode) => traverse(childNode, depth + 1));
}

// Start traversing the AST from the source file
traverse(sourceFile);
Enter fullscreen mode Exit fullscreen mode

When you run this code, it will output the AST structure of the TypeScript code:

SourceFile - 
  FunctionDeclaration - function greet(name: string): void {
    console.log("Hello, " + name + "!");
  }
    Identifier - greet
    Parameter - name: string
      Identifier - name
      StringKeyword - string
    VoidKeyword - void
    Block - {
      ExpressionStatement - console.log("Hello, " + name + "!");
        CallExpression - console.log("Hello, " + name + "!")
          PropertyAccessExpression - console.log
            Identifier - console
            Identifier - log
          StringLiteral - "Hello, "
          BinaryExpression - name + "!"
            Identifier - name
            StringLiteral - "!"
  ExpressionStatement - greet("Mohamed")
    CallExpression - greet("Mohamed")
      Identifier - greet
      StringLiteral - "Mohamed"
Enter fullscreen mode Exit fullscreen mode

In this output:

  • Each line represents a node in the AST.
  • The indentation indicates the parent-child relationship between nodes.
  • The text after the node type (e.g., FunctionDeclaration, Identifier) represents the actual TypeScript code corresponding to that node.

This example demonstrates how TypeScript's Compiler API can be used to parse TypeScript code into an AST and traverse the tree to inspect its structure programmatically.

Conclusion

In summary, Abstract Syntax Trees (ASTs) are a fundamental concept in programming language theory and compiler construction. While not specific to TypeScript, ASTs play a crucial role in the TypeScript ecosystem, enabling type checking, code transformation, and advanced tooling capabilities. Understanding ASTs is essential for developers seeking to leverage TypeScript's features effectively and contribute to the TypeScript ecosystem's growth and evolution.

Top comments (7)

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

So, you're simply using ChatGPT for creating you're blog post AND the comments and not even mention it, as you should by dev.to guidelInes?

Collapse
 
efpage profile image
Eckehard

You are my hero! I´m trying to build a markdown parser from scratch. Ok, there are many implementations, but non of them fits my needs. I know that I should use a syntax tree, but I´m not sure about the implementation. By the way, I want to use the same markdown syntax most parsers use (e.g. this), but some elements are tricky to capture:

  • Ordered Lists do not have a fixed delimiter, any number followed by a dot can be used
  • A square bracket can be a link, if it is followed by a round bracket, other wise it´s just a square bracket. Any hints how to get the right approach? For some reason I do not want to use a parser generator, so I´m building anything from scratch.
Collapse
 
bilelsalemdev profile image
bilel salem

Ordered Lists

Ordered lists in Markdown can have any number followed by a dot (1., 2., etc.).

Tokenization Strategy:

  • Look for lines starting with a pattern \d+\. (one or more digits followed by a dot).
  • Capture the number and the list item content.

Example:

1. First item
2. Second item
10. Tenth item
Enter fullscreen mode Exit fullscreen mode

Parsing Strategy:

  • Create a ListNode for the ordered list.
  • Create ListItemNode children for each list item.

Links

Links in Markdown have the format [text](url).

Tokenization Strategy:

  • Look for patterns matching \[.*?\]\(.*?\).
  • Capture the link text and URL separately.

Example:

[OpenAI](https://www.openai.com)
Enter fullscreen mode Exit fullscreen mode

Parsing Strategy:

  • Create a LinkNode with text and url attributes.

Example Implementation in Python

Here is a simplified example of how you might start implementing this in Python:

import re

class TokenType:
    TEXT = 'TEXT'
    HEADING = 'HEADING'
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM'
    LINK = 'LINK'

class Token:
    def __init__(self, type, value):
        self.type = type
        self.value = value

def tokenize(markdown):
    tokens = []
    lines = markdown.split('\n')

    for line in lines:
        if re.match(r'^\d+\.\s', line):
            tokens.append(Token(TokenType.ORDERED_LIST_ITEM, line))
        elif re.match(r'^\[.*?\]\(.*?\)$', line):
            tokens.append(Token(TokenType.LINK, line))
        else:
            tokens.append(Token(TokenType.TEXT, line))

    return tokens

class NodeType:
    DOCUMENT = 'DOCUMENT'
    PARAGRAPH = 'PARAGRAPH'
    ORDERED_LIST = 'ORDERED_LIST'
    LIST_ITEM = 'LIST_ITEM'
    LINK = 'LINK'

class Node:
    def __init__(self, type, value=None):
        self.type = type
        self.value = value
        self.children = []

def parse(tokens):
    root = Node(NodeType.DOCUMENT)
    current_node = root

    for token in tokens:
        if token.type == TokenType.ORDERED_LIST_ITEM:
            list_item_node = Node(NodeType.LIST_ITEM, token.value)
            if current_node.type != NodeType.ORDERED_LIST:
                current_node = Node(NodeType.ORDERED_LIST)
                root.children.append(current_node)
            current_node.children.append(list_item_node)
        elif token.type == TokenType.LINK:
            match = re.match(r'^\[(.*?)\]\((.*?)\)$', token.value)
            link_node = Node(NodeType.LINK, {'text': match.group(1), 'url': match.group(2)})
            root.children.append(link_node)
        else:
            paragraph_node = Node(NodeType.PARAGRAPH, token.value)
            root.children.append(paragraph_node)

    return root

def render_html(node):
    if node.type == NodeType.DOCUMENT:
        return ''.join(render_html(child) for child in node.children)
    elif node.type == NodeType.PARAGRAPH:
        return f'<p>{node.value}</p>'
    elif node.type == NodeType.ORDERED_LIST:
        return f'<ol>{" ".join(render_html(child) for child in node.children)}</ol>'
    elif node.type == NodeType.LIST_ITEM:
        return f'<li>{node.value}</li>'
    elif node.type == NodeType.LINK:
        return f'<a href="{node.value["url"]}">{node.value["text"]}</a>'

# Example usage
markdown_text = """1. First item
2. Second item
[OpenAI](https://www.openai.com)"""

tokens = tokenize(markdown_text)
ast = parse(tokens)
html = render_html(ast)

print(html)
Enter fullscreen mode Exit fullscreen mode

Explanation

  1. Tokenize: The tokenize function processes each line of the Markdown input and generates tokens based on patterns.
  2. Parse: The parse function converts tokens into an AST. It recognizes ordered list items and links, nesting them appropriately.
  3. Render: The render_html function traverses the AST to generate HTML output.

This is a basic framework. You can extend it by adding more sophisticated handling for other Markdown features like unordered lists, blockquotes, code blocks, etc. Additionally, refining the tokenization and parsing logic will help in accurately capturing and rendering all Markdown syntax.

Collapse
 
efpage profile image
Eckehard • Edited

Thank you so much for your answer!

Please correct me, I´m not used to Python and not very good at reading RegEx, but If I understand it right, you might get trouble with nested brackets like this (?)

[[Link-Text]](URL.com)

Thread Thread
 
bilelsalemdev profile image
bilel salem

You're correct. The regular expression I provided for links doesn't handle nested brackets properly. Nested brackets can indeed cause issues because the regex pattern \[.*?\]\(.*?\) will greedily match the first closing bracket, leading to incorrect tokenization.

To handle nested brackets correctly, we need a more sophisticated approach. One way is to use a stack to track the brackets during the tokenization phase. This approach can manage nested brackets by ensuring each opening bracket has a corresponding closing bracket.

Here is the code example in typescript :

enum TokenType {
    TEXT = 'TEXT',
    HEADING = 'HEADING',
    ORDERED_LIST_ITEM = 'ORDERED_LIST_ITEM',
    LINK = 'LINK'
}

class Token {
    type: TokenType;
    value: string;

    constructor(type: TokenType, value: string) {
        this.type = type;
        this.value = value;
    }
}

function tokenize(markdown: string): Token[] {
    const tokens: Token[] = [];
    const lines = markdown.split('\n');

    for (const line of lines) {
        if (/^\d+\.\s/.test(line)) {
            tokens.push(new Token(TokenType.ORDERED_LIST_ITEM, line));
        } else if (/\[.*?\]\(.*?\)/.test(line)) {
            tokens.push(new Token(TokenType.LINK, line));
        } else {
            tokens.push(new Token(TokenType.TEXT, line));
        }
    }

    return tokens;
}

enum NodeType {
    DOCUMENT = 'DOCUMENT',
    PARAGRAPH = 'PARAGRAPH',
    ORDERED_LIST = 'ORDERED_LIST',
    LIST_ITEM = 'LIST_ITEM',
    LINK = 'LINK'
}

class Node {
    type: NodeType;
    value: any;
    children: Node[];

    constructor(type: NodeType, value: any = null) {
        this.type = type;
        this.value = value;
        this.children = [];
    }
}

function parse(tokens: Token[]): Node {
    const root = new Node(NodeType.DOCUMENT);
    let currentNode = root;

    for (const token of tokens) {
        if (token.type === TokenType.ORDERED_LIST_ITEM) {
            const listItemNode = new Node(NodeType.LIST_ITEM, token.value);
            if (currentNode.type !== NodeType.ORDERED_LIST) {
                currentNode = new Node(NodeType.ORDERED_LIST);
                root.children.push(currentNode);
            }
            currentNode.children.push(listItemNode);
        } else if (token.type === TokenType.LINK) {
            const linkNode = parseLink(token.value);
            root.children.push(linkNode);
        } else {
            const paragraphNode = new Node(NodeType.PARAGRAPH, token.value);
            root.children.push(paragraphNode);
        }
    }

    return root;
}

function parseLink(text: string): Node {
    const stack: string[] = [];
    let linkText = '';
    let url = '';
    let i = 0;
    let insideLinkText = false;

    while (i < text.length) {
        const char = text[i];
        if (char === '[') {
            stack.push(char);
            if (!insideLinkText) {
                insideLinkText = true;
                i++;
                continue;
            }
        } else if (char === ']' && stack[stack.length - 1] === '[') {
            stack.pop();
            if (stack.length === 0) {
                insideLinkText = false;
                i += 2;  // Skip "]("
                continue;
            }
        } else if (char === '(' && !insideLinkText) {
            stack.push(char);
        } else if (char === ')' && stack[stack.length - 1] === '(') {
            stack.pop();
            if (stack.length === 0) {
                break;
            }
        }

        if (insideLinkText) {
            linkText += char;
        } else {
            url += char;
        }

        i++;
    }

    return new Node(NodeType.LINK, { text: linkText, url: url });
}

function renderHTML(node: Node): string {
    switch (node.type) {
        case NodeType.DOCUMENT:
            return node.children.map(renderHTML).join('');
        case NodeType.PARAGRAPH:
            return `<p>${node.value}</p>`;
        case NodeType.ORDERED_LIST:
            return `<ol>${node.children.map(renderHTML).join('')}</ol>`;
        case NodeType.LIST_ITEM:
            return `<li>${node.value}</li>`;
        case NodeType.LINK:
            return `<a href="${node.value.url}">${node.value.text}</a>`;
        default:
            return '';
    }
}

// Example usage
const markdownText = `1. First item
2. Second item
[[Link-Text]](https://www.example.com)`;

const tokens = tokenize(markdownText);
const ast = parse(tokens);
const html = renderHTML(ast);

console.log(html);
Enter fullscreen mode Exit fullscreen mode

The parseLink function uses a stack to handle nested brackets, ensuring that link text and URLs are correctly identified even when nested brackets are present.

Collapse
 
zeedu_dev profile image
Eduardo Zepeda

Was this post created with ChatGPT?

Collapse
 
syeo66 profile image
Red Ochsenbein (he/him)

Yes, it was.