DEV Community

CrabPascal
CrabPascal

Posted on

CrabPascal Parser: From Tokens to AST | Parser: de tokens à AST

Bilingual post · Post bilíngue

Jump to: English · Português


English {#english}

CrabPascal Parser: From Tokens to AST

The lexer hands the parser a stream of tokens. The parser's job is to verify that stream follows Pascal grammar and build an Abstract Syntax Tree (AST) — a tree-shaped data structure that represents program structure without irrelevant punctuation.

CrabPascal v2.22.0 defines dozens of AST node types in Rust, serialized with serde for debugging and sprint regression tests.

Grammar vs AST

Consider this snippet:

procedure Greet(name: string);
begin
  WriteLn('Hello, ', name);
end;
Enter fullscreen mode Exit fullscreen mode

The parser does not store parentheses and semicolons as tree nodes. It produces:

ProcedureDecl {
  name: "Greet",
  params: [Param { name: "name", type: String }],
  body: Block {
    statements: [
      CallStmt { target: "WriteLn", args: [StringLit("Hello, "), Ident("name")] }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Semicolons matter during parsing but disappear in the AST — exactly what later phases need.

Top-level program structure

A .dpr file parses into a Program node:

program CrudAPI;
uses Horse, System.JSON, ProdutoService;
begin
  THorse.Listen(9000);
end.
Enter fullscreen mode Exit fullscreen mode

AST shape:

Program {
  name: "CrudAPI",
  uses: ["Horse", "System.JSON", "ProdutoService"],
  declarations: [],
  block: Block { statements: [...] }
}
Enter fullscreen mode Exit fullscreen mode

Units (.pas) parse into Unit nodes with interface and implementation sections — enabling recursive loading from rtl/ and project search paths.

Expression parsing

Expressions use precedence climbing or recursive descent — standard compiler techniques:

result := (a + b) * c > 10;
Enter fullscreen mode Exit fullscreen mode

Becomes a tree respecting operator precedence (* before +, comparison after arithmetic):

Binary(GreaterThan,
  left: Binary(Multiply,
    left: Binary(Add, Ident("a"), Ident("b")),
    right: Ident("c")),
  right: Integer(10))
Enter fullscreen mode Exit fullscreen mode

Malformed expressions fail here with syntax errors before semantic analysis runs.

OOP constructs

Classes add complexity the parser must handle:

type
  TAnimal = class
  private
    FName: string;
  public
    constructor Create(AName: string);
    procedure Speak; virtual; abstract;
  end;
Enter fullscreen mode Exit fullscreen mode

The AST includes ClassDecl with fields, methods, visibility sections, and modifiers (virtual, abstract, override). Sprint releases progressively harden edge cases — parameterized class methods, nested types, generic declarations.

Error recovery and diagnostics

When parsing fails, CrabPascal reports the first coherent error with line/column:

ProdutoService.pas:28:5: error: Expected 'end' but found 'procedure'
Enter fullscreen mode Exit fullscreen mode

Good error messages depend on the parser tracking token expectations — not generic "syntax error" dumps. This integrates with VS Code via crabpascal.crabpascal.

Inspecting the AST

Development mode in crabpascal.toml:

[development]
save_ast = true
Enter fullscreen mode Exit fullscreen mode

Or use the CLI:

crab-pascal parse examples/crud/crud.dpr
Enter fullscreen mode Exit fullscreen mode

Output shows nested JSON/Rust-debug structures — invaluable when filing parser bugs during sprint reviews.

AST to semantic analysis

The parser does not check types. These are valid parse trees but invalid programs:

var x: Integer;
begin
  x := 'not an integer';  // Semantic error — parser accepts it
end.
Enter fullscreen mode Exit fullscreen mode

Phase 3 (semantic analyzer) walks the AST, resolves symbols, and rejects type mismatches. Clean separation keeps the parser focused on grammar.

Contributing to the parser

Parser changes are the most common sprint deliverables — new syntax (generics, attributes, exception blocks), better error messages, unit header edge cases. Tests add .pas fixtures and assert expected AST snapshots.

If you are learning compilers, the CrabPascal parser is readable Rust — no generated yacc tables, explicit recursive functions you can step through in a debugger.

Tokens in, tree out. Everything after depends on that tree being correct. 🦀


Português {#portugus}

Parser do CrabPascal: de tokens à AST

O lexer entrega ao parser um fluxo de tokens. O trabalho do parser é verificar se esse fluxo segue a gramática Pascal e construir uma Árvore de Sintaxe Abstrata (AST) — estrutura em árvore que representa a estrutura do programa sem pontuação irrelevante.

O CrabPascal v2.22.0 define dezenas de tipos de nós AST em Rust, serializados com serde para debug e testes de regressão por sprint.

Gramática vs AST

Considere este trecho:

procedure Greet(name: string);
begin
  WriteLn('Hello, ', name);
end;
Enter fullscreen mode Exit fullscreen mode

O parser não guarda parênteses e pontos-e-vírgula como nós. Produz:

ProcedureDecl {
  name: "Greet",
  params: [Param { name: "name", type: String }],
  body: Block {
    statements: [
      CallStmt { target: "WriteLn", args: [StringLit("Hello, "), Ident("name")] }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Ponto-e-vírgula importa no parsing, mas some na AST — exatamente o que as fases seguintes precisam.

Estrutura top-level de programas

Um .dpr parseia em nó Program:

program CrudAPI;
uses Horse, System.JSON, ProdutoService;
begin
  THorse.Listen(9000);
end.
Enter fullscreen mode Exit fullscreen mode

Forma da AST:

Program {
  name: "CrudAPI",
  uses: ["Horse", "System.JSON", "ProdutoService"],
  declarations: [],
  block: Block { statements: [...] }
}
Enter fullscreen mode Exit fullscreen mode

Units (.pas) viram nós Unit com seções interface e implementation — permitindo carregamento recursivo de rtl/ e search paths do projeto.

Parsing de expressões

Expressões usam precedence climbing ou descida recursiva — técnicas clássicas:

result := (a + b) * c > 10;
Enter fullscreen mode Exit fullscreen mode

Vira árvore respeitando precedência (* antes de +, comparação depois da aritmética):

Binary(GreaterThan,
  left: Binary(Multiply,
    left: Binary(Add, Ident("a"), Ident("b")),
    right: Ident("c")),
  right: Integer(10))
Enter fullscreen mode Exit fullscreen mode

Expressões malformadas falham aqui com erro de sintaxe, antes da análise semântica.

Construtos OOP

Classes acrescentam complexidade:

type
  TAnimal = class
  private
    FName: string;
  public
    constructor Create(AName: string);
    procedure Speak; virtual; abstract;
  end;
Enter fullscreen mode Exit fullscreen mode

A AST inclui ClassDecl com campos, métodos, seções de visibilidade e modificadores (virtual, abstract, override). Releases de sprint endurecem casos de borda — métodos parametrizados, tipos aninhados, declarações genéricas.

Recuperação de erro e diagnósticos

Quando o parsing falha, o CrabPascal reporta o primeiro erro coerente com linha/coluna:

ProdutoService.pas:28:5: error: Expected 'end' but found 'procedure'
Enter fullscreen mode Exit fullscreen mode

Boas mensagens dependem do parser rastrear expectativas de token — não dumps genéricos de "syntax error". Integra com VS Code via crabpascal.crabpascal.

Inspecionando a AST

Modo desenvolvimento no crabpascal.toml:

[development]
save_ast = true
Enter fullscreen mode Exit fullscreen mode

Ou pela CLI:

crab-pascal parse examples/crud/crud.dpr
Enter fullscreen mode Exit fullscreen mode

A saída mostra estruturas JSON/debug aninhadas — invaluable ao abrir bugs de parser nas reviews de sprint.

AST para análise semântica

O parser não checa tipos. Estas árvores parseiam, mas são programas inválidos:

var x: Integer;
begin
  x := 'not an integer';  // Erro semântico — parser aceita
end.
Enter fullscreen mode Exit fullscreen mode

A fase 3 (analisador semântico) percorre a AST, resolve símbolos e rejeita incompatibilidades. Separação limpa mantém o parser focado em gramática.

Contribuindo com o parser

Mudanças no parser são entregáveis frequentes de sprint — nova sintaxe (generics, attributes, blocos except), mensagens melhores, casos de header de unit. Testes adicionam fixtures .pas e assertam snapshots de AST.

Se você estuda compiladores, o parser CrabPascal é Rust legível — sem tabelas yacc geradas, funções recursivas explícitas para depurar passo a passo.

Tokens entram, árvore sai. Tudo depois depende dessa árvore estar correta. 🦀


Published on dev.to/@crabpascal · Código em CrabPascal

Top comments (0)