It is finally time to write about the engine of the regex engine we’re building.
As we already saw, the parser produces an AST as a result of the parsing of a regular expression.
Now, the engine has to check whether a given string matches or not a regex.
To do so, the engine uses the AST representation of the regex, which extracts the fundamental features/elements of the regular expression, and tries to “find” if the test string has them.
The Engine
The engine is again a Python class, having a fundamental method match, which is a closure containing the engine and all the needed auxiliary methods.
Match signature:
def match(self, re: str, string: str)
Match Semantic
The semantic we will give to the match method is the following:
- it returns on the first occurrence of the regex in the string
- if there is no match, it tries again shifting the input string one character to the right (that is, moving the string one character to the right) until it reaches the end of the string.
That is, if the input string is abac and the regex is c, on the first attempt abac won’t match c, so match tries again, but shifting abac one character to the right this time, so tries to match bac with c, resulting in another fail, so it tries again with ac and c (failing), and, lastly, it tries c with c, resulting in a match, thus returning true.
The
matchfunction so far.
How To Match
match_group is the most important method of the engine.
It works by iterating through an AST node’s children, and by calling itself recursively when a GroupNode is found or, with some differences, when an OrNode is found.
If an OrNode is found, it checks if the left children (the one in position 0) matches, if it doesn’t, it tries the right one.
When a leaf node is found (ElementNode, …) it tries to match it.
After every match, the str_i (the index of the string) is increased of the matched characters.
The string index is set to position 0 (or the first position to consider more in general) at the beginning, and is kind of a “global variable”, but local to the match function.
The mechanism of the recursive calls makes it easy to manage the “point in the tree” and “point in the string” considered. It also allows you to think of each group as a separate entity, making it easier to follow mentally the code execution.
Obviously, this isn’t the whole engine, because as stated in the first article, we will need a backtracking system to assure to explore all the possible paths to match the string.
A little problem
In fact, the semantic of leaf nodes and groups is to match them to the most characters possible by default, and while this is convenient most of the time, there are some edge cases in which this won’t lead to a correct answer.
Let’s examine this example:
If you have a regex a+ab and a test string aab:
- the resulting AST will consist of a
GroupNode, with three children, all leaves - a first
ato be matched [1, +∞[ times - a second
ato be matched exactly one time - a final
bto be matched exactly one time.
During the execution of our match function as presented, this is what will happen:
-
match_groupis called and start iterating on its children - the first
aelement is matched as much as possible, so two timesaa - then the engine tries to match the second
aelement, but it fails as now the only left character is ab - false would be returned, but we all know the correct answer is true.
This is because of the semantics we gave to our match function. What we would like the system to do is to “try another way”:
- since the first element could accept 1 or more occurrences of the letter
a, and it matched two of them, we would like to be able to tell it “hey, please try by matching it one time less” - in this case, the first
awould be matched one time, leaving enough room to fit the secondaelement, and then also thebelement, our third and last - this process worked, and so the answer would be true, as it should be.
This process is called backtracking and will be the topic of the next article, for now, let’s ignore the problem.
Conclusion
Overall, we covered pretty much everything we needed to cover to explain the behavior of the match method.
The skeleton of our code will look something like this
matchfunction’s structure.
In the next article, we will cover the interesting topic of backtracking and will complete our regex engine.


Top comments (1)
What regex engine does Python use? dua to protect husband from another woman