Aakash Gupta

Posted on May 20

How I Parse 14 Languages With One Function — Codewalk Deep Dives #1

#ai #programming #opensource #python

Building a polyglot code parser with tree-sitter that handles Python, Rust, Dart, Go, and 10 more — without writing 14 parsers.

I needed to extract every function and class from any codebase a user throws at me — Python, Dart, Rust, Java, Go, C++, all of them. I started with Python’s built-in “ast” module. It worked — for Python. Then I needed JavaScript. And Rust. And 11 more. I wasn’t about to write 14 parsers.

I ended up with a single “parse_file()” function that handles all 14 languages. Here’s exactly how — using tree-sitter, a registry pattern, and three edge case fallbacks.

WHAT IS CODEWALK?

Codewalk is an open-source AI code analysis tool I’m building. You point it at any codebase and it gives you module detection, blast radius analysis, dependency graphs, reading order, and AI-powered code review— all from one “pip install”. This is the first post in a series where I break down the algorithms and engineering decisions inside Codewalk.

The series:

1 — How I Parse 14 Languages With One Function ← you are here
2 — Building a Universal Import Resolver for 14 Languages
3 — Auto-Detecting Project Modules Without Configuration
4 — BFS Blast Radius + Topological Sort for Code Reading Order
5 — AST-Aware Chunking: Why Function-Level Chunks Beat Blind Splitting
6 — Building an Incremental RAG Pipeline with ChromaDB
7 — AI Code Review Beyond the Diff: 5 Context Sources, 1 Prompt
8 — Designing an 18-Tool MCP Server for Code Onboarding
9 — Voice-Controlled Code Analysis with Whisper + Edge-TTS

[code_parser.py on GitHub]

WHY CODEWALK NEEDS TO PARSE CODE?

Codewalk does more than search — it understands structure. To build a dependency graph, detect modules, calculate blast radius, or generate a reading order, you need to know what’s inside each file: which functions exist, which classes exist, where they start and end.

This matters for three things downstream:

AST-aware chunking — each function becomes its own chunk for embedding into ChromaDB. When you search “retry logic,” you get the exact function, not a random 500-character slice.
Import resolution — knowing which files import which builds the dependency graph. Tree-sitter extracts import statements from any language.
Code review context— when reviewing a diff, codewalk pulls the full function that changed, not just the diff lines.

All of this requires parsing. And since codewalk supports any
codebase — Python, Flutter, Rust monorepos, Go services — the
parser can’t be Python-only.

THE STARTING POINT

Python Was Easy. Then I Needed 13 More.

Python has a built-in “ast” module. You call “ast.parse(source)”, walk the tree, and check for “ast.FunctionDef” and “ast.ClassDef”.

Done in 40 lines:

import ast
def parse_python_file(file_path: str) -> list[dict]:
    tree = ast.parse(source)
    items = []
    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            items.append({"name": node.name, "type": "function", ...})
    return items

But “ast” only works for Python. JavaScript? Rust? Dart? Each language has its own AST format, its own parser, its own library. I wasn’t going to find an “ast” module for 13 more languages.

That’s where [tree-sitter]comes in. Tree-sitter is a parser generator that builds syntax trees from source code — just like Python’s “ast” module, but for any language. Each language has its own grammar, shipped as a separate pip package (“tree-sitter-python”, “tree-sitter-rust”, “tree-sitter-dart”, etc.).

The killer feature: one API across every language. You call “parser.parse(source)”, get a tree, walk the nodes. The API is identical whether you’re parsing Python or Rust. What differs is the node types inside the tree — and that’s where the problem starts.

THE PROBLEM

Every Language Calls a Function Something Different

Tree-sitter gave me one API — but every grammar made different choices about what to call things.

A “function” in tree-sitter’s Python grammar is “function_definition”. In JavaScript, it’s “function_declaration” or “method_definition”. In Rust, it’s
“function_item”. In C, the function name isn’t even on the function node — it’s buried inside a “function_declarator” child node.

The obvious next step was writing a handler per language:

# ❌ The obvious approach: one handler per language
def parse_python(source):
    for node in walk(tree):
        if node.type == "function_definition":
            name = node.child_by_field_name("name")
            # ...
def parse_javascript(source):
    for node in walk(tree):
        if node.type in ("function_declaration", "method_definition"):
            name = node.child_by_field_name("name")
            # ... same logic, different strings
def parse_rust(source):
    for node in walk(tree):
        if node.type == "function_item":
            name = node.child_by_field_name("name")
            # ... same loop again
# ... 11 more of these

14 functions. Same structure. Different strings. This is a textbook case for a data-driven approach.

THE SOLUTION: A NODE TYPE REGISTRY

One Registry, One Walk Function

Instead of 14 parser functions, I built a registry — a dict that maps each language to its AST node types:

NODE_TYPES = {
    "python": {
        "function": ["function_definition"],
        "class": ["class_definition"],
        "name_field": "name",
        "params_field": "parameters",
    },
    "javascript": {
        "function": ["function_declaration", "method_definition"],
        "class": ["class_declaration"],
        "name_field": "name",
        "params_field": "formal_parameters",
    },
    "rust": {
        "function": ["function_item"],
        "class": ["struct_item", "impl_item", "enum_item"],
        "name_field": "name",
        "params_field": "parameters",
    },
    # ... 11 more languages, same shape
}

Every language entry has the same four keys. The values differ, but the structure is identical. That’s what makes a single parser function possible.

Now the walk function doesn’t care what language it’s parsing:

# ✅ After: one function, all 14 languages
def parse_file(file_path: str, language: str) -> list[dict]:
    parser = get_parser_for_language(language)
    node_types = NODE_TYPES[language]
    tree = parser.parse(source)
    function_types = set(node_types["function"])
    class_types = set(node_types["class"])
    all_targets = function_types | class_types
    items = []
    for node in walk_tree(tree.root_node, all_targets):
        name = extract_name(node, node_types["name_field"])
        item_type = "function" if node.type in function_types else "class"
        items.append({"type": item_type, "name": name, ...})
    return items

Adding a new language means adding one dict entry — not a new function. Kotlin took 4 lines. Swift took 4 lines.

THE EDGE CASES (where it gets interesting)

Three Languages That Refused to Play Nice

The registry handles 11 out of 14 languages cleanly. Three needed special treatment — and each one taught me something about how fundamentally different language grammars are.

**Edge Case 1: C/C++ — The Name Is Buried
**In Python, the function name sits directly on the function node:

function_definition
  ├── name: "process_data"      ← right here
  └── parameters: (...)

In C, tree-sitter nests it one level deeper inside a “function_declarator”:

function_definition
  ├── type: "int"
  └── function_declarator          ← name is inside HERE
       ├── declarator: "main"
       └── parameters: (...)

My “extract_name()” function tries the standard approach first, then falls back to digging into the declarator:

def extract_name(node, name_field: str) -> str:
    # Standard: name is directly on the node
    name_node = node.child_by_field_name(name_field)
    if name_node:
        return name_node.text.decode("utf-8")
    # Fallback for C/C++: dig into function_declarator
    for child in node.children:
        if child.type == "function_declarator":
            inner = child.child_by_field_name("declarator")
            if inner:
                return inner.text.decode("utf-8")
    return "<anonymous>"

Not a separate parser — just a fallback path in the same function.

Edge Case 2: Dart — Double Nesting
**Dart’s “method_signature” wraps a “function_signature”, which wraps the actual name. Worse: if you’re not careful, tree-sitter yields **both the outer and inner node — you get every method twice.

method_signature
  └── function_signature        ← ALSO matches "function" node types
       └── name: "fetchUser"

Two problems, two fixes.

For the name, I added another fallback to “extract_name()”:

    # Fallback for Dart: name is inside function_signature
    for child in node.children:
        if child.type in ("function_signature", "getter_signature"):
            inner_name = child.child_by_field_name(name_field)
            if inner_name:
                return inner_name.text.decode("utf-8")

For the duplicate problem, I added “skip_children_types” to the tree walker:

def walk_tree(node, target_types, skip_children_types=None):
    if node.type in target_types:
        yield node
        # Don't recurse into children that would match again
        if node.type in skip_children_types:
            return
    for child in node.children:
        yield from walk_tree(child, target_types, skip_children_types)

When parsing Dart, “function_types” is passed as “skip_children_types”. If we match a “method_signature”, we yield it — but don’t recurse into its “function_signature” child. No duplicates.

**Edge Case 3: TypeScript and PHP — Grammar Naming
**Most tree-sitter grammar packages expose a “language()” function. TypeScript and PHP don’t.

TypeScript’s package has two grammars (TypeScript + TSX), so it exposes “language_typescript()” and “language_tsx()”. PHP exposes “language_php()”.

def get_language(language: str):
    grammar_module = importlib.import_module(GRAMMAR_MAP[language])
    if language == "typescript":
        lang = Language(grammar_module.language_typescript())
    elif language == "php":
        lang = Language(grammar_module.language_php())
    else:
        lang = Language(grammar_module.language())
    _language_cache[language] = lang
    return lang

Without the cache, you’d reload and re-initialize the grammar on every file. With 500 TypeScript files, that matters.

LAZY GRAMMAR LOADING

Lazy Loading 14 Grammars Without Importing 14 Packages

A Rust codebase doesn’t need the PHP grammar loaded. A Python project doesn’t need the Swift grammar. Importing all 14 at startup would be wasteful — and would crash if a grammar package isn’t installed.

Each grammar is a separate pip package (“tree-sitter-python”, “tree-sitter-rust”, etc.). I use “importlib” to load them on demand:

GRAMMAR_MAP = {
    "python":     "tree_sitter_python",
    "javascript": "tree_sitter_javascript",
    "typescript": "tree_sitter_typescript",
    "dart":       "tree_sitter_dart",
    # ... 10 more
}
_language_cache = {}
def get_language(language: str):
    if language in _language_cache:
        return _language_cache[language]
    module_name = GRAMMAR_MAP.get(language)
    if not module_name:
        return None
    try:
        grammar_module = importlib.import_module(module_name)
        lang = Language(grammar_module.language())
        _language_cache[language] = lang
        return lang
    except (ImportError, AttributeError):
        return None  # grammar not installed — skip silently

If “tree-sitter-dart” isn’t installed, Dart files are silently skipped. No crash. No config needed. Install the grammar, restart, done.

THE RESULTS

What This Gets You

The entire parser is ~290 lines — supporting 14 languages.

Languages supported: 14
Core functions: 4 (“get_language”, “extract_name”, “
walk_tree”, “parse_file”)
Lines of code: ~290
Edge case handlers: 3 (C name nesting, Dart dedup, TS/PHP grammar naming)

Adding Kotlin was literally this:

"kotlin": {
    "function": ["function_declaration"],
    "class": ["class_declaration", "object_declaration"],
    "name_field": "name",
    "params_field": "function_value_parameters",
},

Four lines. No new function. No new file.

TAKEAWAYS

When you’re writing the same function with different constants, extract the constants into data. The NODE_TYPES registry turned 14 functions into 1.
Tree-sitter grammars are not uniform. C nests names deeper, Dart double-wraps signatures, TypeScript splits its grammar export. Budget time for edge cases.
Lazy loading with “importlib” lets you support many languages without requiring all of them. Users install only the grammars they need.
“skip_children_types” is a pattern worth stealing. Anytime you’re walking a tree and parent/child nodes overlap in type, you need a way to prevent double-yielding.
The fallback chain pattern (“try standard → try C-style → try Dart-style → return anonymous”) is more maintainable than per-language if/else blocks. Each fallback is independent.

CLOSING

I started with 14 languages and the assumption that each one needed its own parser. The reality? They all have functions and classes — they just call them different things. A dict mapping and three fallback paths handle every language tree-sitter supports.

Next up: these parsed functions mean nothing without knowing which files import which. In Post #2, I build an import resolver that handles Python’s dot-separated paths, Rust’s “crate::” syntax, Dart’s “package:” URIs, and 11 more — each with completely different rules.

That one was harder.

📦 Full source: [code_parser.py on GitHub]

👉 Follow for the next deep dive— or check the series links at the top.