So, ParseJS is a tokenization library I made for JavaScript.
It's core feature is to sort a string into an array of characters (string
s whose length is strictly limited to one (1)) and symbols, where each symbol is a stand-in for a token. (Tokens are symbol
s so you can easily tell between a token and a character.)
Anyways. Where am I going with this?
Well, ParseJS is good, but it's not great.
You can statically parse tokens, like so:
// Parameters: (str:string, toks:string[])
parse_string("test12 test1 test2 test", [
"test",
"test1",
"test12",
"test2"
]);
and it will reliably produce:
[
Symbol.for(test12),
' ',
Symbol.for(test1),
' ',
Symbol.for(test2),
' ',
Symbol.for(test)
]
BUT- there is no way of creating abstract groups of tokens (e.g. like how variable names can be practically anything, but the language doesn't name them for you).
- What I have:
parse_string("class Test: end", [
"class",
':',
"end"
]);
// -> [Sym(class), ' ', 'T', 'e', ..., Sym(:), ...]
- What I want:
parse_string("class Test: end", [
"class",
':',
"end",
/[^0-9\W]\w+]/ // 'g' flag added automatically.
]);
// -> [Sym(class), ' ', Sym(Test), Sym(:), ' ', Sym(end)]
The goal:
- Add regex support to allow abstract token groups to exist.
How I find tokens:
- Loop through each string in
toks
and collect the first character of each string inepl
. - Loop through each character of
str
asc
, and ifc
is inepl
, slice the next few characters ahead to see if a valid keyword exists.
The challenge(s):
- Unlike strings, the length that a regex represents can be variable and would need to be computed.
- The way I check for tokens is by seeing if the character of a keyword exists. - But, I can't exactly do that, since there's no subscript operator, or way to get the character (or potential characters) in a regex.
- I slice the substring to test based on the length of the keywords that exist. But, since I can't get the length(s) that a regex could be, I can't compute how big of a substring I need to slice to test.
Top comments (0)