DEV Community

Discussion on: Regex was taking 5 days to run. So I built a tool that did it in 15 minutes.

Collapse
 
benaryorg profile image
#benaryorg • Edited

Regex was taking 5 days to run.

What you are doing does not even seem like a job for a Regex.
Regex is a pattern matching engine (like globbing).
A good example for what to use Regexes is filtering non-comment lines in config files (/^\s*(#|$)/), as this requires an actual pattern.
Another good thing about Regexes is their grouping

There's a pretty good reason why, at work, I am always using grep -rF foobar /etc over grep -re foobar /etc when looking for something.
Instead of

It's not that regex is slow either, it's just that you are creating a terrible overhead by using it even though you don't even use the overhead for anything.

What you probably want is lazily splitting the input by whitespace and using a replacement-map (as seen here).
There's a few important things with that:

If search&replacing in a large body of text: don't load it to memory.
You will end up replacing variable-length text, so you're either going to need some sort of Rope, or you're running into performance problems, due to lots of copies and/or reallocations.
Best thing to do is lazily reading the file and outputting it somewhere else (preferably reading stdin and writing to stdout, that way the program is very easy to use).

Second, you might want to take a look at PHF tables.
Rust has a great library for that which generates the PHF table at compile time, so that at runtime it only hashes the value with a very fast and efficient algorithm and does a single comparison to get the desired value of the map.

Update

I did implement this using the above methods (in Rust though).
The PHF is raising the compile time, currently ~12 seconds for ~50k replacement entries, which is quite okay I guess.

Runtime scales linearly with the wordlist-length:

This is doing single threaded search&replace in linear time (~1 second per ⅓ million @ 2.60GHz). Those numbers already include all the I/O to be done for the replacement.