DEV Community

Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

Vikash Singh on December 10, 2017

When developers work with text, they often need to clean it up first. Sometimes it’s by replacing keywords. Like replacing “Javascript” with “JavaS...

Read full post

Bob van Hoove • Dec 10 '17

It seems like you discovered that regex is not the right tool for keyword detection or replacement :)

I like how you explain the role of tries in you eventual solution. I learned from your 'mistake', so thanks for writing.

Vikash Singh • Dec 11 '17

Thanks :) Completely agree that one size fit all doesn't work. It's just that when we start, we start with a simpler problem and sometimes to scale we need to re-think at an algorithmic level.

PS: Thanks for your feedback :)

Massimo Artizzu • Dec 10 '17

I've read your article on Medium, and my reaction was that regular expressions simply weren't the right choice.

I really like regular expressions but they have to be used for some kind of tasks, not as a sledgehammer to open a door, when a lockpick could be also available.

Good for you that you've found the lockpick. Even with a language as slow as Python, the difference could be of several orders of magnitude. The point is that regular expressions are a pseudo-language on its own, with its rules and principles.

Remember, you wouldn't use regexes to parse HTML.

Daniel Beecham • Dec 11 '17

But that's because html isn't regular... I don't want to be "that guy", but regular expressions is definitely the lockpick, and this trie-stuff is the sledgehammer. Whenever you can use re, you should, because it's the absolutely fastest and simplest you can do. It's just that many re libraries out there are pretty slow and unintuitive to use...

Massimo Artizzu • Dec 11 '17

Whenever you can use re, you should, because it's the absolutely fastest and simplest you can do.

I'll leave this here:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
Jamie Zawinski, 1997

Now, I think this is extremized, and regular expressions do have a place in development (and I also like them a lot in general), but there are very critical aspects to consider:

they have never been standardized, which means regexes in Perl are different from regexes in JavaScript, in C# and so on;
moreover, regex engines could be starkly different, as they could be either regex- or text-directed;
they're hard to read, so hard that instead of fixing a complex regular expression one could save time and headaches by rewriting them from scratch;
they're hard to use too, because there are quirks and gotchas that, if not treated correctly, could lead to disastrous performances;
they're also immensely harder to debug, because you can't run step-by-step their execution: they're basically atomic statements.

There a lot of cases where you can use regular expressions, but a lot less where you should use them. I'll also leave this:

Programs must be written for people to read, and only incidentally for machines to execute
Harold Abelson, 1984

(Now that's another hyperbole, but the point is that we have to maintain the code we write, so it's better to make out life easier.)

It's just that many re libraries out there are pretty slow and unintuitive to use...

And that's indeed another aspect to consider: many regex engines are slow to boot but we have to deal with that, because we simply just have no alternatives. And also have awkward APIs, too.

So, should we use them anyway, you say?

Daniel Beecham • Dec 12 '17

Sure, that quote is fun, but it doesn't carry a very strong point.

they have never been standardized, which means regexes in Perl are different from regexes in JavaScript, in C# and so on;

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

moreover, regex engines could be starkly different, as they could be either regex- or text-directed;

A well designed regex engine is a finite automata. That's it.

they're hard to read, so hard that instead of fixing a complex regular expression one could save time and headaches by rewriting them from scratch;

I don't think this is too hard to read.

they're hard to use too, because there are quirks and gotchas that, if not treated correctly, could lead to disastrous performances;

Maybe.

they're also immensely harder to debug, because you can't run step-by-step their execution: they're basically atomic statements.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

And that's indeed another aspect to consider: many regex engines are slow to boot but we have to deal with that, because we simply just have no alternatives. And also have awkward APIs, too.

But we do have alternatives. Ragel, as mentioned above, is a really good one. re2 for python is supposedly good. The rust regex is good. Alex is also pretty good.

Massimo Artizzu • Dec 12 '17

The language and what you can do with it is standardized - unions, differences, kleene stars and so on. The rest is just syntax.

Not so easy. The point is that you can't transpose a regular expression from a different language without any second thinking. It means that regular expressions are another layer of programming language that you have to take into consideration.

A well designed regex engine is a finite automata. That's it.

A computer doesn't care what a regex is. You shouldn't either, as it doesn't make any difference. The implementation can be very different and something that should be cared about.

I don't think this is too hard to read.

That's not a regular expression: it's a list of definitions used by a regex engine.

Of course you can step-by-step a RE. It's just a finite automata; just step though it.

Only if you have a library that replicate a regex engine. In many languages, you just use the regular expression engine that's natively implemented, because any non-native solution is usually several orders of magnitude slower, it's basically a reinvented wheel, and it just adds another dependency to the project.

If it's slower and takes additional configuration, the "fastest and simplest you can do" part simply disappears.

But we do have alternatives.

In some languages maybe they're worth considering. But nobody uses a custom regex engine in JavaScript, or in PHP, or in Python, except for very limited cases.

Max Cerrina • Dec 11 '17

I never, ever get tired of that post. Such an important lesson for all developers!

Max Cerrina • Dec 10 '17

Super, super cool! Did you try comparing it to a regex with branching by any chance as well? ie, for finding "python" or "java" or "javascript", did you do ~(java(script)?|python)~ or ~(javascript|java|python)~? I'm just curious; I suspect yours would be faster regardless!

Vikash Singh • Dec 11 '17

Hey Shemayev,

What I did was something like this: \b(javascript|java|python)\b. I didn't benchmark it against something that you are suggesting ~(java(script)?|python)~ I will look at it sometime :) Thanks for the suggestion. Building it for 10K+ keywords will be really hard though. I will try :)

Rémy 🤖 • Dec 10 '17 • Edited

Those days I often work on NLU so that sounds pretty good.

It would be interesting to apply a normalizing function on the input text, just for matching. Something like:

from flashtext.keyword import KeywordProcessor
from unidecode import unidecode

def normalize(c):
    return unidecode(c).lower()

keyword_processor = KeywordProcessor()
keyword_processor.set_normalizer(normalize)
keyword_processor.add_keyword('remy', 'Rémy')
keyword_processor.add_keyword('nicolas', 'Nicolas'))

new_sentence = keyword_processor.replace_keywords(
    'My name is Remy and unlike nicolas it is written with an accent'
)

This would help normalizing the writing of keywords without screwing the whole sentence.

Vikash Singh • Dec 11 '17

@remy : Sorry, I didn't get that completely. Can you please elaborate on the expected output and how normalise function is making it happen?

Rémy 🤖 • Dec 11 '17

Suppose that your input is one of

My name is remy
My name is RÉMY
My name is Rémy

Then your output would be

My name is Rémy

It's like when you say you want to replace different instances of JavaScript. If you want JavaScript formated the same way all the time then you can use this technique to achieve that.

Vikash Singh • Dec 11 '17

That can already be done right?

print(kp.replace_keywords(normalize('My name is remy')))
print(kp.replace_keywords(normalize('My name is RÉMY')))
print(kp.replace_keywords(normalize('My name is Rémy')))

output:
my name is Rémy
my name is Rémy
my name is Rémy

Rémy 🤖 • Dec 11 '17

Yup but then you're getting my name is Rémy instead of My name is Rémy.

Also it would allow to process the string without holding it several times in memory (and thus possibly to work on a stream). If you're dealing with big texts it might be interesting as well

I don't have a direct application right now though, but from the things I usually do I'm guessing it would make sense.

Vikash Singh • Dec 11 '17

Ok Remy, Btw, if we change normalize method to not lower the text your requirement will be solved.

def normalize(c):
    return unidecode(c)

output:
My name is Rémy
My name is Rémy
My name is Rémy

Also, if I call normalize from within FlashText or outside FlashText it will be the same amount of memory and computation.

Still, I will keep looking for a possible use case for your suggestion. Thanks for bringing it up :) :)

Daniel Beecham • Dec 11 '17

I just want to clarify that regular expressions, by the themselves, is not the problem - indeed regular expressions is the theoretically fastest way to parse a string and always grows linearly with the input string length (never by "regex complexity") - the problem is pythons re library and replacement.

Cool nonetheless.

Jacob Matthews • Dec 11 '17

I would be interested in the comparison between flashtext versus RE2. In my experience you almost always want to use RE2 in any language that supports it. It is vastly faster than any other real RE implementation I've used and its limitations (i.e., you can't use the non-regular PCRE features) only stop you from doing things you shouldn't be doing with a regular expression anyway.

Daniel Beecham • Dec 12 '17

I agree. I would be even more interested in a comparison between flashtext and a well designed Ragel parser (or similar; I've heard that the Rust RE is pretty good too).

Vikash Singh • Dec 12 '17

I will checkout RE2 and Ragel. I had tried Regex in java also and some advanced Regex libraries in python.

The conclusion I came to was that FlashText is fast because it doesn't do backtracking. Any Regex library that does backtracking would be slower, time complexity wise. C/C++ might execute faster, but backtracking will be slow Algo.

I will still checkout the libraries you guy suggested, Thanks :)

Jacob Matthews • Dec 12 '17

Yes -- RE2 doesn't do backtracking precisely so that it can run faster, which is why I thought it would be an interesting comparison. Here is a page that describes the basic idea, and here is a larger pool of resources for efficient regex implementation (considering "real" regular expressions only, not PCRE-compatible regexes that may contain non-regular constructs that would need to be implemented with backtracking).

Paddy3118 • Dec 12 '17

Hmm. I applaud you for creating a library that is useful, but, having a solution that works in 5 days - If you had a multi-core machine you might have used N instances of your regex program running on 1/n'th of your inputs to get it down to running in, say, a day?

Given hundreds of replacements, I would have at least got estimated run times for that version where you look up each word in a dictionary of replacements.
(Simplistically):

out = ' '.join([lookup.get(word, word) for word in text.strip().split()])

But the community gains a new tool! I applaud you sir :-)

Vikash Singh • Dec 12 '17 • Edited

@paddy3118 FlashText is designed to deal with multi term keywords like 'java script' getting replaced with 'javascript'. There is also the problem of extracting java from I like java. (notice the full stop in the end). There are multiple other problems with assuming that this problem is as simple as you assumed it to be.

PS: You assuming that I wouldn't have tried your suggestion is fine, but you assuming that everyone who clapped are not smart enough to figure your suggestion by themselves is not.

Paddy3118 • Dec 12 '17

but you assuming that everyone who clapped are not smart enough to figure your suggestion by themselves is not

Not sure of what you are accusing me of there?

On the comment on what makes a word, and multi-word lookups then solving those issues could be thought of as a task you would have to do for your new library, but the new library then goes on to use a trie whereas dicts are a built-in datatype.

Vikash Singh • Dec 12 '17

I am sorry, I can't help you. Let's move on in life and do better things :)

All the best :)

#benaryorg • Mar 18 '18 • Edited

Regex was taking 5 days to run.

What you are doing does not even seem like a job for a Regex.
Regex is a pattern matching engine (like globbing).
A good example for what to use Regexes is filtering non-comment lines in config files (/^\s*(#|$)/), as this requires an actual pattern.
Another good thing about Regexes is their grouping

There's a pretty good reason why, at work, I am always using grep -rF foobar /etc over grep -re foobar /etc when looking for something.
Instead of

It's not that regex is slow either, it's just that you are creating a terrible overhead by using it even though you don't even use the overhead for anything.

What you probably want is lazily splitting the input by whitespace and using a replacement-map (as seen here).
There's a few important things with that:

If search&replacing in a large body of text: don't load it to memory.
You will end up replacing variable-length text, so you're either going to need some sort of Rope, or you're running into performance problems, due to lots of copies and/or reallocations.
Best thing to do is lazily reading the file and outputting it somewhere else (preferably reading stdin and writing to stdout, that way the program is very easy to use).

Second, you might want to take a look at PHF tables.
Rust has a great library for that which generates the PHF table at compile time, so that at runtime it only hashes the value with a very fast and efficient algorithm and does a single comparison to get the desired value of the map.

Update

I did implement this using the above methods (in Rust though).
The PHF is raising the compile time, currently ~12 seconds for ~50k replacement entries, which is quite okay I guess.

Runtime scales linearly with the wordlist-length:

This is doing single threaded search&replace in linear time (~1 second per ⅓ million @ 2.60GHz). Those numbers already include all the I/O to be done for the replacement.

Vinay Pai • Dec 11 '17

Out of curiosity, how long would a simple replace() take on your document set? Regular expressions are a good tool to use when you need to do complex matches, but are pretty inefficient when you're doing a simple text replacement.

str.replace() is likely to be far more efficient than re.sub() when you're just doing simple string matching and not really using any of the power of regular expressions.

Vinay Pai • Dec 11 '17

To be clear, it's still likely to be a good bit slower than FlashText, but I'm just curious what the difference is.

Vikash Singh • Dec 12 '17

Hey Vinay,

I had 10K+ terms. It simply didn't make sense to do 20K replace calls. Plus I need word boundaries to be honoured, So the only choice for me was some re library. Hope that answers your question.

PS: each str.replace() will go over the entire document/string. so 20K * no of docs (will be too much complexity. + it won't take word boundaries into consideration.

pcbua • Dec 11 '17

Thanks for sharing. Please try wReplace for 'massive' 500+ replacement:
(just Google for 'wReplace', I can't put links here).
Tell me if it fits your needs. If possible, please share/point me to your benchmark data.

Vikash Singh • Dec 12 '17 • Edited

Sure, Benchmark code is linked in the article

From the article

Code used for the benchmark shown above is linked here, and here.

is wReplace a UI based tool ? Is there an API interface for the library??

Geoff Langdale • Dec 12 '17

Building your own regex or string matching library is a rewarding experience.

That being said, this is familiar terrain. We built Hyperscan ( github.com/intel/hyperscan ) several years ago. I would hope it runs your workload faster than 5 days, and as a bonus extra, you can use multiple regular expressions, not just fixed literal strings.

Vikash Singh • Dec 12 '17

Cool I will check it out Geoff.

There are very good Regex libraries out there, no question. But what is the best library, If I don't want to match a regular expression, rather just words with word boundaries?

I didn't find anything solving the problem I was facing (I googled and tried libraries for weeks before giving up and writing my own).. I am very lazy that way, Avoiding work till it's not an absolute necessity :D.

My use case does not involve multiple regular expressions, I do NLP and I deal with a lot of strings. I found it as a good use-case, hence wrote the library :)

PS: In my company we have built quite a few projects around it, and I didn't even ask people to use it, they started using it on their own accord. So I guess it's a good validation that there was a requirement for such a library :)

Geoff Langdale • Dec 12 '17

Multiple string match is a fun problem. I wrote a number of the subsystems that went into Hyperscan as "literal matchers": an early system that used hashing (essentially multiple hash function Rabin-Karp), a large-scale 'bucketed shift-or' algorithm called FDR, and a small-scale - up to about 96 literals - SIMD algorithm called Teddy).

I've also played around with a few other ideas; a favorite recent one is using PEXT (BMI2 instruction on Haswell and beyond) to look up a series of tables. This nice part of this is that you can pick which bits you want to key on (unlike hashing). So if you have a bunch of short strings (say 1 character) and someone has also given you a gazillion strings ending in the same suffix but differing a fair bit at, say, the 40th through 46th bit from the end, you can use PEXT to pull out those particular bits (say 0..7 and 40..46) and look up a table that tells you instantly whether you have one of your short strings and also helps reduce the number of strings you have to check if you have one of your difficult suffixes.

The Holy Grail here is a nice reliable string matching library that has good performance in the bad case but can hit 1 cycle per byte on non-matching traffic.

Aho-Corasick is a good solid algorithm, but the main problem with in IMO is that it's one damn thing after another; specifically a series of reads that depend on the previous state and the data. So it's never going to go all that fast (but it is extremely reliable and not too hard to understand).

Monsieur Cellophane • Mar 18 '18

Most people will reach for regexes (re.sub(r'foo', 'bar', article)) when what they want is replace (article.replace('foo','bar')) plus some tokenization. I'd bet that using replace + a loop on a vector of keywords/replacements would be faster then tries (though less space efficient).

edA‑qa mort‑ora‑y • Dec 10 '17

I love Tries. They are my go-to data structure when performance is needed. I've used them to solve dice letter games, create a search suggestion service, and more.

Vikash Singh • Dec 11 '17

Cool :) I am also loving Tries so much more now :)

Lilly Gill • Dec 12 '17

It is trending on reddit! Awesome discussions there....
redd.it/7j3433

Vikash Singh • Dec 12 '17

Hey Lilly,

Thanks for sharing.. Facing a simple problem with Trie re. Would love to switch though, if I get some help with this.

compiled_re = trie_regex_from_words(['c++'])

print(compiled_re.findall('c++ is awesome'))

compiled_re = trie_regex_from_words(['.Net'])

print(compiled_re.findall('.NET is awesome'))

# output:
# []
# []

Paddy3118 • Dec 26 '17 • Edited

Ah, as I thought; tries in Python are not as fast, (or as simple), as using a dict based solution: paddy3118.blogspot.co.uk/2017/12/n...

Adrian B.G. • Dec 11 '17

Tries tend to do that, be fast :D

Couldn't solutions like elastic search or such be used to also modify texts? or they are only for searches.

Vikash Singh • Dec 11 '17

I use elastic search heavily at my work place, and It did cross our mind (at least for extract keywords part). But then Elasticsearch is complicated to setup, load and play around with (because of special character's it becomes much more complicated. FlashText is much more easy to do and portable. It has become much more integrated it is in our projects now :)

All that is possible because it's fast and simple.

Tries tend to do that, be fast :D Good one :D

Nurettin Onur TUĞCU • May 26 '19

Have you tried benchmarking against a simple "word" in corpus?

Monsieur Cellophane • Dec 12 '17

fgrep(1) comes to mind.

jameswan • Oct 4 '18

Thanks for the article. Can you please explain how you used this for your word2vec, doc2vec process? Was it used for pre-processing prior to training word2vec or doc2vec? Can you please elaborate.

Zen • Feb 18 '20

Is this available in Javascript (front end)?