How to Build a RegEx Engine in Python (Part 1: The Grammar)

#python #programming #regex #coding

In this series of articles, we will follow the steps to build a RegEx Engine in Python. I chose Python over the many alternatives for a few reasons:

It is well integrated with the Linux command line
Because yes.

In a way, I find Python to be a “less verbose version of Java” and a “less confusing version of JavaScript”, which were the other two alternatives I considered.

Please let me explain before get mad at me.

Before starting this project I had no idea how things should be done, how to code the various modules, and so on.

I had to figure everything out by myself.

And, when I don’t know how to do things exactly, I find JS a bit too confusing, it’s just a feeling, not a universal truth, and I find that with Java I can organize my ideas better.

But, since I don’t like how verbose and limiting Java is in some way, but wanted classes, and I love having functions as first-class citizens, Python seemed like the perfect blend to me.

Anyway, it’s now time to start with the project itself.

What We Need

To match a string with a regular expression of our will we need to complete a few steps:

understand the regular expression
create an internal representation of it (so that we can see if a passed string match with it)
match the string with the regex, using the internal representation we built of it.

To complete the first two steps we need two components:

a Lexer (or scanner) to parse the regex in tokens (“words”, generic “pieces of information”). As an example in the sentence “I am beautiful, very beautiful” each word (“I”, “am”, …) is a token of type “word” and the comma is a token of type “punctuation”
a Parser which read all the tokens in sequence and “understands” them, building while doing so an internal representation of the regex (which will be a tree, called Abstract Syntax Tree, AST for the friends).

To accomplish the third step we need to build a component able to take as input the AST built by the Parser and a string, and visiting the tree to match each leaf of the tree with some part of the string (and something more, we will discuss later).

This component is named:

Engine.

The Roadmap

So, now that we know the modules we will have to code, let’s build a roadmap:

define the RegEx grammar we want to recognize
build the lexer
build the parser
build the engine
enjoy.

The Grammar

It’s now time to discuss the grammar we want to recognize.

I assume you know what a formal grammar is and that you understand the EBNF notation, if not, DON’T PANIC, google it.

The regex grammar we will recognize is the following:

Top-level production RE ::= RE_SEQ is indeed useless but is there because I used it in one of the early versions of the project and I was too lazy to remove it later. Anyway, it is completely harmless, I promise, and it will cost you just a couple of lines of code and only one more call on the stack during execution.