DEV Community

Calin Baenen
Calin Baenen

Posted on

How could I add regex (regular expression) support to ParseJS using my current method of finding tokens?

So, ParseJS is a tokenization library I made for JavaScript.
It's core feature is to sort a string into an array of characters (strings whose length is strictly limited to one (1)) and symbols, where each symbol is a stand-in for a token. (Tokens are symbols so you can easily tell between a token and a character.)

Anyways. Where am I going with this?
Well, ParseJS is good, but it's not great.

You can statically parse tokens, like so:

// Parameters: (str:string, toks:string[])
parse_string("test12 test1 test2 test", [
  "test",
  "test1",
  "test12",
  "test2"
]);
Enter fullscreen mode Exit fullscreen mode

and it will reliably produce:

[
  Symbol.for(test12),
  ' ',
  Symbol.for(test1),
  ' ',
  Symbol.for(test2),
  ' ',
  Symbol.for(test)
]
Enter fullscreen mode Exit fullscreen mode

BUT- there is no way of creating abstract groups of tokens (e.g. like how variable names can be practically anything, but the language doesn't name them for you).


  • What I have:
parse_string("class Test: end", [
  "class",
  ':',
  "end"
]);
// -> [Sym(class), ' ', 'T', 'e', ..., Sym(:), ...]
Enter fullscreen mode Exit fullscreen mode
  • What I want:
parse_string("class Test: end", [
  "class",
  ':',
  "end",
  /[^0-9\W]\w+]/ // 'g' flag added automatically.
]);
// -> [Sym(class), ' ', Sym(Test), Sym(:), ' ', Sym(end)]
Enter fullscreen mode Exit fullscreen mode

The goal:

  • Add regex support to allow abstract token groups to exist.

How I find tokens:

  • Loop through each string in toks and collect the first character of each string in epl.
  • Loop through each character of str as c, and if c is in epl, slice the next few characters ahead to see if a valid keyword exists.

The challenge(s):

  • Unlike strings, the length that a regex represents can be variable and would need to be computed.
  • The way I check for tokens is by seeing if the character of a keyword exists. - But, I can't exactly do that, since there's no subscript operator, or way to get the character (or potential characters) in a regex.
  • I slice the substring to test based on the length of the keywords that exist. But, since I can't get the length(s) that a regex could be, I can't compute how big of a substring I need to slice to test.

Top comments (0)