Building a polyglot code parser with tree-sitter that handles Python, Rust, Dart, Go, and 10 more — without writing 14 parsers.
I needed to extract every function and class from any codebase a user throws at me — Python, Dart, Rust, Java, Go, C++, all of them. I started with Python’s built-in “ast” module. It worked — for Python. Then I needed JavaScript. And Rust. And 11 more. I wasn’t about to write 14 parsers.
I ended up with a single “parse_file()” function that handles all 14 languages. Here’s exactly how — using tree-sitter, a registry pattern, and three edge case fallbacks.
WHAT IS CODEWALK?
Codewalk is an open-source AI code analysis tool I’m building. You point it at any codebase and it gives you module detection, blast radius analysis, dependency graphs, reading order, and AI-powered code review— all from one “pip install”. This is the first post in a series where I break down the algorithms and engineering decisions inside Codewalk.
The series:
1 — How I Parse 14 Languages With One Function ← you are here
2 — Building a Universal Import Resolver for 14 Languages
3 — Auto-Detecting Project Modules Without Configuration
4 — BFS Blast Radius + Topological Sort for Code Reading Order
5 — AST-Aware Chunking: Why Function-Level Chunks Beat Blind Splitting
6 — Building an Incremental RAG Pipeline with ChromaDB
7 — AI Code Review Beyond the Diff: 5 Context Sources, 1 Prompt
8 — Designing an 18-Tool MCP Server for Code Onboarding
9 — Voice-Controlled Code Analysis with Whisper + Edge-TTS
WHY CODEWALK NEEDS TO PARSE CODE?
Codewalk does more than search — it understands structure. To build a dependency graph, detect modules, calculate blast radius, or generate a reading order, you need to know what’s inside each file: which functions exist, which classes exist, where they start and end.
This matters for three things downstream:
- AST-aware chunking — each function becomes its own chunk for embedding into ChromaDB. When you search “retry logic,” you get the exact function, not a random 500-character slice.
- Import resolution — knowing which files import which builds the dependency graph. Tree-sitter extracts import statements from any language.
- Code review context— when reviewing a diff, codewalk pulls the full function that changed, not just the diff lines.
All of this requires parsing. And since codewalk supports any
codebase — Python, Flutter, Rust monorepos, Go services — the
parser can’t be Python-only.
THE STARTING POINT
Python Was Easy. Then I Needed 13 More.
Python has a built-in “ast” module. You call “ast.parse(source)”, walk the tree, and check for “ast.FunctionDef” and “ast.ClassDef”.
Done in 40 lines:
import ast
def parse_python_file(file_path: str) -> list[dict]:
tree = ast.parse(source)
items = []
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef):
items.append({"name": node.name, "type": "function", ...})
return items
But “ast” only works for Python. JavaScript? Rust? Dart? Each language has its own AST format, its own parser, its own library. I wasn’t going to find an “ast” module for 13 more languages.
That’s where [tree-sitter]comes in. Tree-sitter is a parser generator that builds syntax trees from source code — just like Python’s “ast” module, but for any language. Each language has its own grammar, shipped as a separate pip package (“tree-sitter-python”, “tree-sitter-rust”, “tree-sitter-dart”, etc.).
The killer feature: one API across every language. You call “parser.parse(source)”, get a tree, walk the nodes. The API is identical whether you’re parsing Python or Rust. What differs is the node types inside the tree — and that’s where the problem starts.
THE PROBLEM
Every Language Calls a Function Something Different
Tree-sitter gave me one API — but every grammar made different choices about what to call things.
A “function” in tree-sitter’s Python grammar is “function_definition”. In JavaScript, it’s “function_declaration” or “method_definition”. In Rust, it’s
“function_item”. In C, the function name isn’t even on the function node — it’s buried inside a “function_declarator” child node.
The obvious next step was writing a handler per language:
# ❌ The obvious approach: one handler per language
def parse_python(source):
for node in walk(tree):
if node.type == "function_definition":
name = node.child_by_field_name("name")
# ...
def parse_javascript(source):
for node in walk(tree):
if node.type in ("function_declaration", "method_definition"):
name = node.child_by_field_name("name")
# ... same logic, different strings
def parse_rust(source):
for node in walk(tree):
if node.type == "function_item":
name = node.child_by_field_name("name")
# ... same loop again
# ... 11 more of these
14 functions. Same structure. Different strings. This is a textbook case for a data-driven approach.
THE SOLUTION: A NODE TYPE REGISTRY
One Registry, One Walk Function
Instead of 14 parser functions, I built a registry — a dict that maps each language to its AST node types:
NODE_TYPES = {
"python": {
"function": ["function_definition"],
"class": ["class_definition"],
"name_field": "name",
"params_field": "parameters",
},
"javascript": {
"function": ["function_declaration", "method_definition"],
"class": ["class_declaration"],
"name_field": "name",
"params_field": "formal_parameters",
},
"rust": {
"function": ["function_item"],
"class": ["struct_item", "impl_item", "enum_item"],
"name_field": "name",
"params_field": "parameters",
},
# ... 11 more languages, same shape
}
Every language entry has the same four keys. The values differ, but the structure is identical. That’s what makes a single parser function possible.
Now the walk function doesn’t care what language it’s parsing:
# ✅ After: one function, all 14 languages
def parse_file(file_path: str, language: str) -> list[dict]:
parser = get_parser_for_language(language)
node_types = NODE_TYPES[language]
tree = parser.parse(source)
function_types = set(node_types["function"])
class_types = set(node_types["class"])
all_targets = function_types | class_types
items = []
for node in walk_tree(tree.root_node, all_targets):
name = extract_name(node, node_types["name_field"])
item_type = "function" if node.type in function_types else "class"
items.append({"type": item_type, "name": name, ...})
return items
Adding a new language means adding one dict entry — not a new function. Kotlin took 4 lines. Swift took 4 lines.
THE EDGE CASES (where it gets interesting)
Three Languages That Refused to Play Nice
The registry handles 11 out of 14 languages cleanly. Three needed special treatment — and each one taught me something about how fundamentally different language grammars are.
**Edge Case 1: C/C++ — The Name Is Buried
**In Python, the function name sits directly on the function node:
function_definition
├── name: "process_data" ← right here
└── parameters: (...)
In C, tree-sitter nests it one level deeper inside a “function_declarator”:
function_definition
├── type: "int"
└── function_declarator ← name is inside HERE
├── declarator: "main"
└── parameters: (...)
My “extract_name()” function tries the standard approach first, then falls back to digging into the declarator:
def extract_name(node, name_field: str) -> str:
# Standard: name is directly on the node
name_node = node.child_by_field_name(name_field)
if name_node:
return name_node.text.decode("utf-8")
# Fallback for C/C++: dig into function_declarator
for child in node.children:
if child.type == "function_declarator":
inner = child.child_by_field_name("declarator")
if inner:
return inner.text.decode("utf-8")
return "<anonymous>"
Not a separate parser — just a fallback path in the same function.
Edge Case 2: Dart — Double Nesting
**Dart’s “method_signature” wraps a “function_signature”, which wraps the actual name. Worse: if you’re not careful, tree-sitter yields **both the outer and inner node — you get every method twice.
method_signature
└── function_signature ← ALSO matches "function" node types
└── name: "fetchUser"
Two problems, two fixes.
For the name, I added another fallback to “extract_name()”:
# Fallback for Dart: name is inside function_signature
for child in node.children:
if child.type in ("function_signature", "getter_signature"):
inner_name = child.child_by_field_name(name_field)
if inner_name:
return inner_name.text.decode("utf-8")
For the duplicate problem, I added “skip_children_types” to the tree walker:
def walk_tree(node, target_types, skip_children_types=None):
if node.type in target_types:
yield node
# Don't recurse into children that would match again
if node.type in skip_children_types:
return
for child in node.children:
yield from walk_tree(child, target_types, skip_children_types)
When parsing Dart, “function_types” is passed as “skip_children_types”. If we match a “method_signature”, we yield it — but don’t recurse into its “function_signature” child. No duplicates.
**Edge Case 3: TypeScript and PHP — Grammar Naming
**Most tree-sitter grammar packages expose a “language()” function. TypeScript and PHP don’t.
TypeScript’s package has two grammars (TypeScript + TSX), so it exposes “language_typescript()” and “language_tsx()”. PHP exposes “language_php()”.
def get_language(language: str):
grammar_module = importlib.import_module(GRAMMAR_MAP[language])
if language == "typescript":
lang = Language(grammar_module.language_typescript())
elif language == "php":
lang = Language(grammar_module.language_php())
else:
lang = Language(grammar_module.language())
_language_cache[language] = lang
return lang
Without the cache, you’d reload and re-initialize the grammar on every file. With 500 TypeScript files, that matters.
LAZY GRAMMAR LOADING
Lazy Loading 14 Grammars Without Importing 14 Packages
A Rust codebase doesn’t need the PHP grammar loaded. A Python project doesn’t need the Swift grammar. Importing all 14 at startup would be wasteful — and would crash if a grammar package isn’t installed.
Each grammar is a separate pip package (“tree-sitter-python”, “tree-sitter-rust”, etc.). I use “importlib” to load them on demand:
GRAMMAR_MAP = {
"python": "tree_sitter_python",
"javascript": "tree_sitter_javascript",
"typescript": "tree_sitter_typescript",
"dart": "tree_sitter_dart",
# ... 10 more
}
_language_cache = {}
def get_language(language: str):
if language in _language_cache:
return _language_cache[language]
module_name = GRAMMAR_MAP.get(language)
if not module_name:
return None
try:
grammar_module = importlib.import_module(module_name)
lang = Language(grammar_module.language())
_language_cache[language] = lang
return lang
except (ImportError, AttributeError):
return None # grammar not installed — skip silently
If “tree-sitter-dart” isn’t installed, Dart files are silently skipped. No crash. No config needed. Install the grammar, restart, done.
THE RESULTS
What This Gets You
The entire parser is ~290 lines — supporting 14 languages.
Languages supported: 14
Core functions: 4 (“get_language”, “extract_name”, “
walk_tree”, “parse_file”)
Lines of code: ~290
Edge case handlers: 3 (C name nesting, Dart dedup, TS/PHP grammar naming)
Adding Kotlin was literally this:
"kotlin": {
"function": ["function_declaration"],
"class": ["class_declaration", "object_declaration"],
"name_field": "name",
"params_field": "function_value_parameters",
},
Four lines. No new function. No new file.
TAKEAWAYS
- When you’re writing the same function with different constants, extract the constants into data. The NODE_TYPES registry turned 14 functions into 1.
- Tree-sitter grammars are not uniform. C nests names deeper, Dart double-wraps signatures, TypeScript splits its grammar export. Budget time for edge cases.
- Lazy loading with “importlib” lets you support many languages without requiring all of them. Users install only the grammars they need.
- “skip_children_types” is a pattern worth stealing. Anytime you’re walking a tree and parent/child nodes overlap in type, you need a way to prevent double-yielding.
- The fallback chain pattern (“try standard → try C-style → try Dart-style → return anonymous”) is more maintainable than per-language if/else blocks. Each fallback is independent.
CLOSING
I started with 14 languages and the assumption that each one needed its own parser. The reality? They all have functions and classes — they just call them different things. A dict mapping and three fallback paths handle every language tree-sitter supports.
Next up: these parsed functions mean nothing without knowing which files import which. In Post #2, I build an import resolver that handles Python’s dot-separated paths, Rust’s “crate::” syntax, Dart’s “package:” URIs, and 11 more — each with completely different rules.
That one was harder.
📦 Full source: [code_parser.py on GitHub]
👉 Follow for the next deep dive— or check the series links at the top.
Top comments (0)