DEV Community

Mohamed Shams El-Deen
Mohamed Shams El-Deen

Posted on

Regex isn't the best solution in all cases: Building a Recursive TypeScript Parser for nodejs/doc-kit

Introduction
A crucial part of modern documentation is cross-linking. When a developer reads an API doc, clicking a type should navigate them directly to its definition.

During my GSoC 2026 journey redesigning Webpack’s documentation, I realized the issue wasn't just in Webpack's codebase. The root of the problem lived upstream in nodejs/doc-kit, the engine responsible for parsing TypeScript annotations and generating Markdown links.

It started with a simple bug: the parser didn't understand basic generics like Promise<string>. But what began as a quick Regex fix evolved into building a Recursive Descent Parser capable of understanding deeply nested generic trees, operator precedence, and function signatures. Here is the technical breakdown of how I built it.

Phase 1: The Regex Trap

My initial approach PR #668 was naive. I thought, "It's just < >, I can use a Regular Expression 👀" I wrote a simple regex /^([^<]+)<([^>]+)>$/ to capture the base type and the inner type.

It worked perfectly for Promise<string>. But during the code review, a maintainer (Aviv Keller) asked a simple yet destructive question: "What about Promise<string | boolean>?"

Standard string splitting would blindly break the generic in half, resulting in malformed syntax. To patch this, I wrote a temporary utility called splitByOuterUnion that tracked the depth of the brackets, ensuring it only split the | symbol if it wasn't trapped inside < >:

const splitByOuterUnion = str => {
  const result = [];
  let current = '';
  let depth = 0;

  for (const char of str) {
    if (char === '<') {
      depth++;
    } else if (char === '>') {
      depth--;
    } else if (char === '|' && depth === 0) {
      result.push(current);
      current = '';
      continue;
    }
    current += char;
  }

  result.push(current);
  return result;
};
Enter fullscreen mode Exit fullscreen mode

It was a smart patch, but deep down, I knew it was just a band-aid.

Phase 2: The Breaking Point

A few weeks later, we needed to support highly complex types spanning multiple parameters, nested generics, and inline functions, such as:
(str: string[]) => Promise<Map<string, number & string>, Map<string | number>>

Regex was officially useless. You cannot parse infinitely nested mathematical structures with Regular Expressions. We needed a custom parser PR #763.

I deleted the regex approach and designed a top-down Recursive Descent Parser. Instead of trying to match the whole string at once, the parser breaks it down layer by layer, from the weakest operator to the strongest.

Here are the 3 pillars of this architecture:

Pillar 1: Depth Tracking (walkAtDepthZero)

The foundation of the parser is an iterator that walks through the string character by character, keeping track of bracket depth () {} <> []. It takes a callback onToken that only executes when depth === 0 (meaning we are at the root scope, not trapped inside a nested generic or function parameter).

const TYPE_OPENERS = new Set(['<', '(', '{', '[']);
const TYPE_CLOSERS = new Set(['>', ')', '}', ']']);

const walkAtDepthZero = (str, onToken) => {
  let depth = 0;
  for (let i = 0; i < str.length; i++) {
    const char = str[i];
    if (TYPE_OPENERS.has(char)) {
      depth++;
    } else if (TYPE_CLOSERS.has(char) && !isArrowTail(str, i)) {
      depth--;
    }

    if (depth === 0) {
      const skip = onToken(i, char);
      if (skip === true) {
        return;
      }
      if (typeof skip === 'number') {
        i += skip;
      }
    }
  }
};

Enter fullscreen mode Exit fullscreen mode

Pillar 2: Operator Precedence

In TypeScript, the arrow function => wraps around unions |, which wrap around intersections &. Our parser needs to find the weakest link to split the string correctly.
Using walkAtDepthZero, I built findTopLevelOperator to scan the root scope and locate the first occurrence of these operators, respecting their precedence:

const findTopLevelOperator = str => {
  let arrowIdx = -1;
  let unionIdx = -1;
  let intersectIdx = -1;

  walkAtDepthZero(str, (i, char) => {
    if (char === '=' && str[i + 1] === '>') {
      if (arrowIdx === -1) {
        arrowIdx = i;
      }
      return 1; // skip '>'
    }
    if (char === '|' && unionIdx === -1) {
      unionIdx = i;
    } else if (char === '&' && intersectIdx === -1) {
      intersectIdx = i;
    }
  });

  if (arrowIdx !== -1) {
    return { op: '=>', index: arrowIdx, width: 2 };
  }

  if (unionIdx !== -1) {
    return { op: '|', index: unionIdx, width: 1 };
  }

  if (intersectIdx !== -1) {
    return { op: '&', index: intersectIdx, width: 1 };
  }

  return null;
};
Enter fullscreen mode Exit fullscreen mode

Pillar 3: The Recursive Engine (parseType)

This is where the magic happens. parseType acts as the orchestrator. It uses findTopLevelOperator to split the string into left and right chunks. Then, it calls itself recursively on those smaller chunks until it strips away all arrays, generics, and parentheses, finally reaching the base types to generate the Markdown URLs.

export const parseType = (typeString, transformType) => {
  const trimmed = stripOuterParentheses(typeString);
  if (!trimmed) {
    return null;
  }

  const op = findTopLevelOperator(trimmed);
  if (op) {
    if (op.op === '=>') {
      const left = trimmed.slice(0, op.index).trim();
      const right = trimmed.slice(op.index + op.width).trim();
      const sig = parseFunctionSignature(left, transformType);
      return `${sig} =&gt; ${resolveOr(right, transformType)}`;
    }

    // Union / intersection
    const parts = splitByOuterSeparator(trimmed, op.op);
    const joiner = op.op === '|' ? ' | ' : ' & ';
    return parts.map(p => resolveOr(p, transformType)).join(joiner);
  }

  // Strip a trailing `[]` for now; reapply on the way out.
  const isArray = trimmed.endsWith('[]');
  const core = isArray ? trimmed.slice(0, -2).trim() : trimmed;
  const arrayTail = isArray ? '[]' : '';

  // Generic: `Base<...>`.
  const ltIdx = core.indexOf('<');
  if (ltIdx !== -1 && core.endsWith('>')) {
    const baseType = core.slice(0, ltIdx).trim();
    const innerType = core.slice(ltIdx + 1, -1).trim();
    const inner = splitByOuterSeparator(innerType, ',')
      .map(arg => resolveOr(arg, transformType))
      .join(', ');
    return `${formatType(baseType, transformType)}&lt;${inner}&gt;${arrayTail}`;
  }

  // Plain base type.
  if (!core.length) {
    return null;
  }

  const url = transformType(core);
  if (!url) {
    return null;
  }

  return `[\`<${core}>\`](${url})${arrayTail}`;
};

Enter fullscreen mode Exit fullscreen mode

What I Learned

  • Cross-Repo Impact: Fixing an issue in Webpack is great, but fixing the underlying tool (nodejs/doc-kit) impacts the entire ecosystem.
  • Regex is not a Parser: Regex is for pattern matching, not syntax analysis.
  • Think Like a Compiler: Understanding Abstract Syntax Trees (AST) and operator precedence isn't just academic theory; it's practically required to build reliable developer tooling.

Conclusion

We successfully replaced the regex fallback with the recursive parser. Now, Webpack's documentation (and any other project using nodejs/doc-kit) can accurately render and link extreme edge cases of TypeScript signatures.

Writing a parser from scratch taught me that sometimes the "quick fix" is just a delay. Taking the time to understand the root of the problem and engineering a mathematically sound solution is always worth the effort 🤍

Top comments (4)

Collapse
 
nazar_boyko profile image
Nazar Boyko

Genuinely nice write-up. One thing I got curious about in walkAtDepthZero: since every opener bumps the same depth counter and every closer drops it, does a mismatched pair like Foo<Bar] still land at depth zero and get treated as valid? I'm guessing malformed input never reaches the parser because the annotations come from a typed source, so it was never worth handling, but I wondered if you had to guard against it or just leaned on the input always being well-formed.

Collapse
 
moshams272 profile image
Mohamed Shams El-Deen

Exactly, I actually considered using a Stack during the initial design. However, I stepped back and looked at our pipeline: we don't parse raw, human-written source code. We actually traverse the types.d.ts (TypeScript declaration) files.

Since these .d.ts files are the final, generated output of the TypeScript compiler, they act as a pre-validated AST. If a developer accidentally typed Foo<Bar], the TS compiler would throw a syntax error and fail to generate the declaration file long before our parser even runs.

Because of that the input is perfectly well-formed, using a Stack would just mean unnecessary memory allocation. So, relying on a single integer became a conscious optimization to keep the parser as lightweight and fast as possible 😊

Collapse
 
frank_signorini profile image
Frank

I've found regex to be limiting for complex parsing tasks, how did you handle edge cases with your recursive parser? I'd love to swap ideas on this.

Collapse
 
moshams272 profile image
Mohamed Shams El-Deen

Think of it as a tree. To break down this tree, we must first ensure we are scanning at the root level (depth 0), strip any redundant outer parentheses, and then split the string into branches based on operator precedence. Let's take an example:

Input: (str: string[]) => Promise<string | number>

Enter fullscreen mode Exit fullscreen mode

Step 1: Find the Root Split (Weakest Operator at Depth 0)
The parser walks through the string. It finds => at depth 0. This is our weakest operator, so we cut the tree into two main branches (Left and Right). Although It sees the | inside the < >, it ignores it because the depth is 1.

  • Left Branch (Parameters): (str: string[])
  • Right Branch (Return Type): Promise<string | number>

Step 2: Process the Left Branch (Parameters)
We pass (str: string[]) back into the parser recursively:

  • It identifies it as a function signature.
  • It extracts the argument: str: string[].
  • It splits by : to separate the name from the type.
  • It processes string[] (strips the [], converts string to a link, and reattaches []).
  • Result: (str: [<string>](link)[])

Step 3: Process the Right Branch (Return Type)
We pass Promise<string | number> back into the parser recursively:

  • At depth 0, there are no operators like =>, |, or &.
  • It identifies a Generic format: Base<Inner>.
  • It splits this into the base type (Promise) and the inner arguments (string | number).

Step 4: Process the Inner Generic (Recursion again!)
We pass string | number back into the parser recursively:

  • Now, at depth 0, the parser finds the | operator!
  • It splits the string into string and number.
  • Both are base types, so they are instantly converted into links: [<string>](link) and [<number>](link).
  • It joins them back with the | operator.
  • Result: [<string>](link) | [<number>](link)

Step 5: Final Reassembly
The parser climbs back up the tree, putting the pieces together:

  1. Right branch becomes: [<Promise>](link)<[<string>](link) | [<number>](link)>
  2. Full string becomes: Left => Right

Final Output:
(str: [<string>](link)[]) => [<Promise>](link)<[<string>](link) | [<number>](link)>

Just think of it like tree and think of root depth and order of precedence 🙃

What's your idea 👀