DEV Community

Lorenzo Felletti
Lorenzo Felletti

Posted on • Originally published at python.plainenglish.io

How to Build a RegEx Engine in Python (Part 6: the engine)

It is finally time to write about the engine of the regex engine we’re building.

As we already saw, the parser produces an AST as a result of the parsing of a regular expression.

Now, the engine has to check whether a given string matches or not a regex.

To do so, the engine uses the AST representation of the regex, which extracts the fundamental features/elements of the regular expression, and tries to “find” if the test string has them.

The Engine

The engine is again a Python class, having a fundamental method match, which is a closure containing the engine and all the needed auxiliary methods.

Match signature:

def match(self, re: str, string: str)
Enter fullscreen mode Exit fullscreen mode

Match Semantic

The semantic we will give to the match method is the following:

  • it returns on the first occurrence of the regex in the string
  • if there is no match, it tries again shifting the input string one character to the right (that is, moving the string one character to the right) until it reaches the end of the string.

That is, if the input string is abac and the regex is c, on the first attempt abac won’t match c, so match tries again, but shifting abac one character to the right this time, so tries to match bac with c, resulting in another fail, so it tries again with ac and c (failing), and, lastly, it tries c with c, resulting in a match, thus returning true.

The  raw `match` endraw  function so far.

The match function so far.

How To Match

match_group is the most important method of the engine.

It works by iterating through an AST node’s children, and by calling itself recursively when a GroupNode is found or, with some differences, when an OrNode is found.

If an OrNode is found, it checks if the left children (the one in position 0) matches, if it doesn’t, it tries the right one.

When a leaf node is found (ElementNode, …) it tries to match it.

After every match, the str_i (the index of the string) is increased of the matched characters.

The string index is set to position 0 (or the first position to consider more in general) at the beginning, and is kind of a “global variable”, but local to the match function.

The mechanism of the recursive calls makes it easy to manage the “point in the tree” and “point in the string” considered. It also allows you to think of each group as a separate entity, making it easier to follow mentally the code execution.

Obviously, this isn’t the whole engine, because as stated in the first article, we will need a backtracking system to assure to explore all the possible paths to match the string.

A little problem

In fact, the semantic of leaf nodes and groups is to match them to the most characters possible by default, and while this is convenient most of the time, there are some edge cases in which this won’t lead to a correct answer.

Let’s examine this example:

If you have a regex a+ab and a test string aab:

  • the resulting AST will consist of a GroupNode, with three children, all leaves
  • a first a to be matched [1, +∞[ times
  • a second a to be matched exactly one time
  • a final b to be matched exactly one time.

During the execution of our match function as presented, this is what will happen:

  • match_group is called and start iterating on its children
  • the first a element is matched as much as possible, so two times aa
  • then the engine tries to match the second a element, but it fails as now the only left character is a b
  • false would be returned, but we all know the correct answer is true.

This is because of the semantics we gave to our match function. What we would like the system to do is to “try another way”:

  • since the first element could accept 1 or more occurrences of the letter a, and it matched two of them, we would like to be able to tell it “hey, please try by matching it one time less”
  • in this case, the first a would be matched one time, leaving enough room to fit the second a element, and then also the b element, our third and last
  • this process worked, and so the answer would be true, as it should be.

This process is called backtracking and will be the topic of the next article, for now, let’s ignore the problem.

Conclusion

Overall, we covered pretty much everything we needed to cover to explain the behavior of the match method.

The skeleton of our code will look something like this

 raw `match` endraw  function’s structure.

match function’s structure.

In the next article, we will cover the interesting topic of backtracking and will complete our regex engine.

Top comments (1)

Collapse
 
matlockcarol profile image
MatlockCarol

What regex engine does Python use? dua to protect husband from another woman