GDS K S

Posted on Apr 10

Parsing 11 languages in pure Go without CGO: how I replaced regex with a tree-sitter runtime

#go #webdev #programming #opensource

I had a problem. I was building a codebase indexer that needed to extract imports, exports, and type definitions from source files across multiple languages. The obvious approach was regex. I wrote regex parsers for TypeScript, Python, Rust, and Java.

They worked. Mostly.

Then someone opened a TypeScript file with dynamic imports, re-exports, and generic type parameters. The regex missed all of it. A Python file with decorated class methods and type annotations? Same story. The regex saw module-level def statements and nothing else.

I needed real parsers. Tree-sitter was the obvious choice. But the Go bindings all required CGO, which meant I could not cross-compile my tool for 6 platforms using GoReleaser without setting up C toolchains for each target.

The CGO problem

Tree-sitter is written in C. The two main Go binding libraries (tree-sitter/go-tree-sitter and smacker/go-tree-sitter) both use CGO to call into the C runtime.

CGO breaks cross-compilation. My tool ships binaries for darwin-arm64, darwin-amd64, linux-amd64, linux-arm64, windows-amd64, and windows-arm64. With CGO, I would need a C cross-compiler for each target, a matrix of Docker images, and a release pipeline that takes 20 minutes instead of 90 seconds.

I set CGO_ENABLED=0 in my GoReleaser config on day one and did not want to change that.

A pure Go tree-sitter runtime

Then I found gotreesitter by @odvcencio. It is a ground-up reimplementation of the tree-sitter runtime in pure Go. No C code. No CGO. No shared libraries.

The library ships 205 embedded grammars as compressed blobs that lazy-load on first use. You call grammars.DetectLanguage("file.ts") and it returns the TypeScript grammar. Parse the file. Walk the AST. Done.

entry := grammars.DetectLanguage("example.ts")
lang := entry.Language()
parser := gts.NewParser(lang)
tree, _ := parser.Parse(src)
root := tree.RootNode()

The parser, lexer, query engine, incremental reparsing, and external scanners are all implemented in Go. Cross-compiles to anything Go targets, including WASM.

The extraction architecture

I built one TreeSitterParser that dispatches to per-language extraction functions. Each language gets its own file with one function:

internal/parser/
  treesitter.go       -- dispatch: detect language, parse, route to extractor
  ts_typescript.go    -- extractTypeScript(root, lang, src, path)
  ts_python.go        -- extractPython(root, lang, src, path)
  ts_rust.go          -- extractRust(root, lang, src, path)
  ts_java.go          -- extractJava(root, lang, src, path)
  ts_csharp.go        -- extractCSharp(...)
  ts_ruby.go          -- extractRuby(...)
  ts_php.go           -- extractPHP(...)
  ts_kotlin.go        -- extractKotlin(...)
  ts_swift.go         -- extractSwift(...)
  ts_c.go             -- extractC(...)  (handles C, C++, Objective-C)

Each extractor walks the AST using gts.Walk and collects four things: imports, exports (with signatures), type definitions (struct/class fields), and entrypoint detection.

What tree-sitter nodes actually look like

This is the part that took trial and error. Each language grammar produces different node types, and the documentation is... sparse.

For example, export class MyClass {} in TypeScript produces:

program
  export_statement
    export
    class_declaration
      class
      type_identifier  "MyClass"
      class_body

The class name is a type_identifier, not an identifier. I initially wrote childByType(node, lang, "identifier") and got nothing back for classes. Functions use identifier. Classes use type_identifier. Interfaces use type_identifier. Enums use identifier again. There is no consistency across node types.

Every language has these quirks. Python function definitions store parameters in a parameters field but return types in a return_type field. Rust visibility modifiers are a separate visibility_modifier child node, not an attribute on the declaration. Java distinguishes class_declaration from record_declaration from enum_declaration, each with slightly different child structures.

I ended up writing helper functions that try multiple node type names:

func tsGetIdentifier(node *gts.Node, lang *gts.Language, src []byte) string {
    n := childByType(node, lang, "identifier")
    if n == nil {
        n = childByType(node, lang, "type_identifier")
    }
    if n == nil {
        return ""
    }
    return nodeText(n, src)
}

The Merkle hash trick for incremental updates

The indexer generates output files that you commit to git. But you do not want to regenerate on every commit if only a README changed.

I compute a Merkle hash of all source files. The hash only changes when actual code changes. Non-source files (docs, configs, lock files) are excluded from the hash.

On commit, the git hook runs stacklit generate. It computes the current hash, compares it to the stored hash, and skips regeneration if they match. Fast path: one hash comparison, no file parsing.

Binary size trade-off

The old regex parsers added negligible size. Switching to gotreesitter with 205 embedded grammars increased the binary from 12MB to 32MB.

I only use 11 of those 205 grammars. The library lazy-loads grammars on first use, so unused grammars do not affect runtime memory. But their compressed blobs are compiled into the binary regardless.

20MB of grammar data for zero-CGO cross-compilation felt like an acceptable trade. The alternative was maintaining regex parsers that missed half the syntax, or setting up CGO cross-compilation infrastructure.

Performance

Tree-sitter is fast. Parsing a single file takes sub-millisecond in most cases. The bottleneck is I/O (reading files from disk), not parsing.

Stacklit indexing itself (66 Go files):     83ms
Express.js (141 JS files):                 100ms
FastAPI (1,131 Python files):              400ms
Axum (300 Rust files):                     300ms

The old regex parsers were faster (30ms for the Go repo) because regex is cheaper than building a full AST. But 83ms versus 30ms is not a difference anyone notices.

What regex missed that tree-sitter catches

Some concrete examples from real files:

TypeScript dynamic imports:

const module = await import('./heavy-module');

Regex: missed entirely. Tree-sitter: call_expression with import identifier.

Python class methods:

class AuthService:
    def login(self, email: str, password: str) -> bool:
        ...
    def logout(self) -> None:
        ...

Regex: saw class AuthService only. Tree-sitter: sees both methods with full signatures and type annotations.

Rust generics and trait bounds:

pub fn sort<T: Ord + Clone>(items: &mut Vec<T>) -> &[T]

Regex: captured sort only. Tree-sitter: captures the full signature including generic parameters and bounds.

Java method signatures:

public ResponseEntity<List<User>> getUsers(@RequestParam int page)

Regex: captured class UserController only. Tree-sitter: captures the method name, return type, and parameter types.

The code

Everything described here is MIT licensed and on GitHub:

https://github.com/glincker/stacklit

The tree-sitter integration is in internal/parser/treesitter.go and the per-language extractors are in internal/parser/ts_*.go. The Merkle hashing is in internal/git/merkle.go.

The tool itself generates a codebase index (JSON + Mermaid diagram + interactive HTML visualization) that is useful for understanding project architecture and for giving context to coding tools.

Top comments (2)

Shiva • Apr 19

Why not just codepathfinder.dev/mcp (Opensource, Apache 2.0 licensed)

GDS K S • Apr 21

Great tool, Thanks for sharing.. Will definitely try it !!