Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

View post

Replies for: Building your own regex or string matching library is a rewarding experience. That being said, this is familiar terrain. We built Hyperscan ( git...

Cool I will check it out Geoff.

There are very good Regex libraries out there, no question. But what is the best library, If I don't want to match a regular expression, rather just words with word boundaries?

I didn't find anything solving the problem I was facing (I googled and tried libraries for weeks before giving up and writing my own).. I am very lazy that way, Avoiding work till it's not an absolute necessity :D.

My use case does not involve multiple regular expressions, I do NLP and I deal with a lot of strings. I found it as a good use-case, hence wrote the library :)

PS: In my company we have built quite a few projects around it, and I didn't even ask people to use it, they started using it on their own accord. So I guess it's a good validation that there was a requirement for such a library :)

Geoff Langdale • Dec 12 '17

Multiple string match is a fun problem. I wrote a number of the subsystems that went into Hyperscan as "literal matchers": an early system that used hashing (essentially multiple hash function Rabin-Karp), a large-scale 'bucketed shift-or' algorithm called FDR, and a small-scale - up to about 96 literals - SIMD algorithm called Teddy).

I've also played around with a few other ideas; a favorite recent one is using PEXT (BMI2 instruction on Haswell and beyond) to look up a series of tables. This nice part of this is that you can pick which bits you want to key on (unlike hashing). So if you have a bunch of short strings (say 1 character) and someone has also given you a gazillion strings ending in the same suffix but differing a fair bit at, say, the 40th through 46th bit from the end, you can use PEXT to pull out those particular bits (say 0..7 and 40..46) and look up a table that tells you instantly whether you have one of your short strings and also helps reduce the number of strings you have to check if you have one of your difficult suffixes.

The Holy Grail here is a nice reliable string matching library that has good performance in the bad case but can hit 1 cycle per byte on non-matching traffic.

Aho-Corasick is a good solid algorithm, but the main problem with in IMO is that it's one damn thing after another; specifically a series of reads that depend on the previous state and the data. So it's never going to go all that fast (but it is extremely reliable and not too hard to understand).