Hi everyone! I’m Daniil and I’m a Java developer at Just AI. In this article, I’m going to tell you how we faced a backtracking issue for regex and how we solved this problem using re2j.
But most importantly, I’ll cover both advantages and disadvantages of this library.
Here’s what you’ll learn:
- How we got a problem with backtracking in a regular expression.
- How we dealt with it using re2j.
- What not obvious issues can you face while using this library?
What is backtracking?
There’s been a lot of discussions around this problem. And for sure, I’m not the first or the last person to raise this question.
In a few words, this problem appears under specific circumstances and causes regex matching to run for a really long time.
Also, the real problem starts when you have no control over the regex your customers create and handling customer data is a very sensitive task .
You can checkout this article for more details and dive deeper into the subject.
How we addressed this issue
We’re developing JAICP - is a conversational AI platform that helps users create chatbots and often have to deal with customer information in conversations with bots.
Of course, customer data can be quite sensitive. But we really care about privacy and obfuscate confidential information.
It looks like this:
The key moment here is that customers can set their own regex to obfuscate data and create a string as a substitution.
Let’s catch problem first
Now that we know what our task is, let’s first take a look at a rather obvious solution with the standard Java regex.
We have quite a common example of obfuscation and it’s also a basic pattern for our case. It is an email obfuscation. Let's substitute them with xxxx@xxxx.xx for example. Here is an email regex pattern for substitution:
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
Instead of this regex, customers can write whatever they want.
Let's see how long the following code will run with the standard java engine:
String inputText = TestUtils.getRandomLongWord("test@test.com");
Matcher matcher = Pattern.compile(TestUtils.emailRegex).matcher(inputText);
matcher.replaceAll("XXX@XXX.XX");
In this and the following examples, we’ll generate random text with target injections. In the example above, it’s an email address. It should look something like the following:
...UUID.randomUUID()+"test@test.com"+UUID.randomUUID()
We call randomUUID 500 times and inject target text every 10 calls.
I used an i7-10510U CPU 1.80GHz × 8 machine and jdk 1.8.
Execution time - 36 ms
It’s quite fast. But, let’s see what happens when the target text hasn’t been injected.
The text is the same, but now it looks like this:
UUID.random()+UUID.random()
String inputText = TestUtils.getRandomLongWord("");
Matcher matcher = Pattern.compile(TestUtils.emailRegex).matcher(inputText);
matcher.replaceAll("XXX");
Execution time - 8879 ms
🤔 We seem to have a backtracking problem now.
Solution
To fix it, we’ll use the Re2j library.
This library allows for running regexes in linear time. And it doesn’t use a standard NFA approach for a regex engine, but DFA instead.
There are articles like this and this one to help you find your way around it.
It’s important to note that this library was invented exactly for the situations when we don’t know the regex to execute. If you’re facing the same issue, you definitely should give it a go.
But is Re2j so good that it can fix the 9 seconds issue? Let’s see!
To check it, we should just substitute our Pattern
with com.google.re2j.Pattern
and that’s it.
com.google.re2j.Pattern.compile(emailRegex)
.matcher(TestUtils.getRandomLongWord(""))
.replaceAll("XXX")
The new execution time is 88 ms
🔥🔥🔥
Amazing! I’m sure you are with me in wanting to call it a silver bullet and start to use this library everywhere.
But first, let’s take a closer look at a few more things.
What about the performance?
To measure performance, we will use the JMH benchmark. It’s a popular open-source tool for such purposes.
The benchmark body looks the same as in the example above:
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1)
@Warmup(iterations = 2)
@Measurement(iterations = 5)
public void standartRegexWithEmailDataBenchmark(MyState myState, Blackhole blackhole) {
myState.standartEmailRegex.matcher(myState.emailInputTextWords).replaceAll("XXX");
}
standartEmailRegex
is java.util.regex.Pattern.compile(TestUtils.emailRegex)
The average replaceAll
execution time for the pattern with email injection is:
- Default java pattern -
0.004 second
- Re2j pattern -
0.002 second
We can see little performance advantages of the re2j over the default pattern. But the main thing is the re2j result wasn’t worse then the default one. And that means we can use it even for call-intensive parts of our program.
What about the big dictionary?
The re2j performs well in practice. But it’s worth checking a case where we have a big dictionary of words to match with the target text.
Let’s take a look at an example with the US cities: Boston, Chicago, Washington...
It transfers to regex like Boston|Chicago|Washington...
Of course, we use different data at work. But as an example, I took a list of the US cities from this website. I got it using the following js code:
Array.from(document.getElementsByClassName("topic-content pt-sm-15")[0].getElementsByTagName("section")).slice(1).flatMap(x => Array.from(x.children[1].children)).map(x => x.outerText)
This resulted in a dictionary with1,950
items. It’s smaller than the one we have in production but it’s still quite big.
So, we have a default java.util.regex
Pattern:
java.util.regex.Pattern.compile(String.join("|", TestUtils.usCitiesAndTowns))
And com.google.re2j
Pattern:
com.google.re2j.Pattern.compile(String.join("|", TestUtils.usCitiesAndTowns));
We will use the JMH again.
An average replaceAll
execution time for the pattern with cities injection is:
- Default java pattern -
0.147 second
- Re2j pattern -
0.625 second
It means Re2j
is 4 times slower
then the default java regex😲
But what about the pattern size?
It’s not an obvious one but let’s check compiled size of our pattern.
Could something like Pattern.compile("\\d{2}")
be a problem?
To see, we’ll use ObjectSizeCalculator.getObjectSize
from nashorn
package. It’s not available in versions newer than JDK 8, but in version 8 it’s quite a good way to measure an object size in runtime.
So, we’ll measure the following code:
Pattern.compile(String.join("|", TestUtils.usCitiesAndTowns)
Where TestUtils.usCitiesAndTowns
is that big dictionary from the example above.
The final result should be something like: ObjectSizeCalculator.getObjectSize(Pattern.compile(String.join("|", TestUtils.usCitiesAndTowns)))
-
com.google.re2j.Pattern
- 1026872 bytes (~1mb) -
java.util.regex.Pattern
- 197608 bytes (~0.2mb)
We got a five-fold difference!!! 🦾
The re2j pattern spends 5x more memory then the default one.
Even our big but not gigantic dictionary uses up to 1 mb of memory. But the real-life dictionary for such cases could be much bigger. And it could be duplicated to several services and instances. So, our pattern could take a lot of memory.
Conclusion
We’ve discussed the re2j from different sides and mentioned some obscure but important disadvantages.
To sum it all up:
The main goal of this library is to solve long execution issues for cases with regex coming from the outside.
When using this library make sure to test it on your data and for your cases because:
- There could be opposite performance results.
- The Compiled Pattern could use more memory which may cause problems.
- Re2j doesn’t support all sets of regex abilities from the default java implementation.
P.S.
The entire piece of code is available on github.
Top comments (0)