Discussion on: how to make a parser?

View post

This is a big topic and the approach you need to use depends on your end goal. I'll try to explain it shortly and hopefully you'll get some keywords that you can google.

The old traditional way is tokenizing, then lexing, then parsing. In each step, you basically annotate your input to be more structured and then use that annotation in the next. The output of tokenizing is a token stream, which typically means that you split up the incoming characters into bigger, meaningful items. In your example above, this would mean you get number, operator, number, operator, number.

Then, lexing will take that token stream and use this information and the rules of your language to yield an abstract syntax tree. At this point, the tree can be traversed and the expression can be understood by a program (i.e. parsed).

Tokenizers and lexers can be auto-generated from formal grammar definitions, such as LR(1). yacc/bison and flex are tools that can be used for this. However, the generated code might not be very maintainable if your language is too complex. There is a certain middle ground of complexity where it's worth generating this stuff and where you don't get a monster of generated code that you can't touch.

Another, more modern approach is using something like a PEG grammar / PEG parser generator. PEG grammars bunch the tokenizing and lexing steps together and generally give you more fidelity in expressing the rules of your toy language. You didn't mention the language you were using to implement your parser, but I had some success with parsimonious using Python: github.com/erikrose/parsimonious (you can also search "xyz PEG parser" where xyz is the language of your choice).

Finally, not every language can be expressed with a formal grammar, but if you are creating something from scratch, it's a really good idea to at least start with that. Otherwise, you will have to find context manually in the lower levels (e.g. in the token stream or the text itself), or even from other sources (like the output of parsing other files).

Dinu Blanovschi • Dec 23 '20

tokenizing, then lexing, then parsing

Then, lexing will take that token stream and use this information and the rules of your language to yield an abstract syntax tree.

Note that tokenizing = lexing = getting a token stream from the input text, while parsing refers to building up the AST.