Cool Regex performance hack for "not preceded with"

#regex #javascript

The first part of this post applies only if you run ASP.NET, you can simply skip to the regex hack if not interested, but I strongly recommend you read the whole story.

Debugging ASP.NET CPU spikes

About three weeks ago we started seeing unusual CPU load spikes on our app's production server. We do have a super useful tool (originally written by Sam Saffron, forked by us) showing the exact call stacks that use up all the CPU ticks

But the problem is you have to "attach" this tool to the process at the exact moment the CPU load spikes up. And "catching" that moment can be... uhm... hard.

Even though we've set up AWS to text us when CPU load gets higher than 75% and stays there for 5 consecutive minutes (a must have for anyone who runs a business-critical SaaS app by the way) the time it takes to run to my laptop, remote-login to the sever, get to the command-line, attach to the overloaded process etc etc... CPU load gets back to normal. Darn.

Luckily my cofounder has written a short Windows .BAT script that does this automatically - if the CPU load gets higher than 75% the script finds the web-application process ID, launches the tool, attaches it to the PID and dumps the call-stack data to a text file (if you're going to use the script, change the "OurApplicationPool" to the actual user account that runs your app pool).

Feel free to use:

@Echo OFF

SET /A "MAXUSAGE=75"

For /F %%P in ('wmic cpu get loadpercentage ^| FINDSTR "[0-9]"') do (
echo %%P
    IF %%P GTR %MAXUSAGE% (
        for /f "tokens=1,2 delims= " %%A in ('tasklist /FI "USERNAME eq OurApplicationPool" /fo table /nh') do cpu-analyzer.exe %%B >> log_%date:~-4,4%%date:~-7,2%%date:~-10,2%.txt
    )
)

Then we have simply set up a job in Windows Task Scheduler that runs every 5 minutes (I know, ugly, but we needed a quick and dirty solution).

Optimizing Regex

After some digging through the logs and call stacks it turned out most of the CPU ticks were spent on Regex.Search. Our app uses A LOT of regular expressions both on the back and front ends. For example, when converting markdown and "BbCode" (used for internal storage) to "normal" HTML (used for UI and emails).

Long story short, after hours of blindly trying different regex patterns, testing and benchmarking here's what I found:

If you want to match a rule that goes "something is not preceded with an 'A'" don't ever use code like this: (^|[^a])something.

The first group basically says "it's either the beginning of the text OR not the letter A" and most people find more readable compared to a "regex lookbehind", including myself. But it turns out regex "lookbehinds" and "lookaheads" are almost 3-4 times faster than patterns involving "beginning/end of string" especially when working with large stings.

In our case replacing (^|[^a]) with an (?<!a) (which literally means "not preceded with") has made our code 3 times faster. I wrote a quick benchmark test that does 10000 search-replace iterations in both .NET and JavaScript regex engines (we use both) and the results were:

(^|[^a])something - 1346ms (avg)
(?<!a)something - 389ms (avg)

Whoa, that's more than 3 times faster! Obviously, the longer the input text you have, the bigger the impact.

Cool. But can we improve it even further? Is there a way to make the "not preceded with an A" rule even faster?

Turns out there is! Unfortunately there are not many regex performance analysis on the net, so (again) I was on my own here... But fortunately I somehow remembered reading a cool HackerNews post couple of years ago that suggests a great way to optimize rules like "not preceded with" or "not surrounded with" or "not ends with" and similar. Basically the trick goes like this:

asomething|(something)

Nice, huh? If you want to match "something not preceded with an A" simply match what you DON'T WANT and then ignore overall matches, looking only at the "Group 1"! The benchmarks:

(?<!a)something - 389ms (avg)
asomething|(something) - 16ms (avg)

Holy sh*t! That's almost a hundred times faster than my original code!