loading...
Cover image for Speed up Regex performance with .NET 5

Speed up Regex performance with .NET 5

marcusturewicz profile image Marcus Turewicz Updated on ・8 min read

.NET 5 Preview 1 was recently released and one of the improvements has been in Regex performance. And Although Microsoft have said they will do a deep dive on Regex performance soon (edit: Microsoft's deep dive), I thought doing some preliminary benchmarks might be a nice way to test out Preview 1.

Disclaimer: these are microbenchmarks done outside of a lab, on a small dev machine, so take them with a grain of salt.

Installing .NET 5 Preview 1

You can download .NET 5 Preview 1 from the .NET Downloads page. As I'm on a Mac, I downloaded and ran the x64 installer.

Upon installing, it was nice to see the branding has been updated to remove "Core":

.NET 5 installer

Let's do a quick check to see if it installed correctly:

$ dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   5.0.100-preview.1.20155.7
 Commit:    1c44853056

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  10.14
 OS Platform: Darwin
 RID:         osx.10.14-x64
 Base Path:   /usr/local/share/dotnet/sdk/5.0.100-preview.1.20155.7/

Host (useful for support):
  Version: 5.0.0-preview.1.20120.5
  Commit:  3c523a6a7a

.NET Core SDKs installed:
  3.1.102 [/usr/local/share/dotnet/sdk]
  5.0.100-preview.1.20155.7 [/usr/local/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.App 3.1.2 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 5.0.0-preview.1.20124.5 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.2 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.0-preview.1.20120.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download

Looks like the CLI still think's it's "Core". I guess there's a little ways to go yet to update all the branding. But, it installed correctly which is a good start!

Anyway, let's get on with the benchmarking. So the idea here is to come up with a some decent Regex patterns and run them against .NET Core 3.1 and .NET 5.0 and compare the performance (time, allocations etc). So let's find some complex Regex patterns!

Choosing Regex patterns to test

Searching on Google and GitHub, I quickly came across the repo mariomka/regex-benchmark which tests three Regex patterns against multiple languages.

The benchmarks they test are finding the following information in some text:

  • Email: [\w\.+-]+@[\w\.-]+\.[\w\.-]+
  • URI: [\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?
  • IP: (?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])

Yes: these may not be the most efficient patterns for their respective tasks, but this is about having something to benchmark so it's ok.

The input data to apply the Regex patterns to can be found here.

Great. Now that we have our Regex patterns to test, let's create the benchmark runner.

Creating the benchmark runner

We could use System.Diagnostics.StopWatch in a console app to run the benchmarks (intuitively what most of us might do, and what the repo above does), but the .NET Team has developed, and uses themselves, BenchmarkDotNet because:

BenchmarkDotNet helps you to transform methods into benchmarks, track their performance, and share reproducible measurement experiments. It's no harder than writing unit tests! Under the hood, it performs a lot of magic that guarantees reliable and precise results. BenchmarkDotNet protects you from popular benchmarking mistakes and warns you if something is wrong with your benchmark design or obtained measurements. The results are presented in a user-friendly form that highlights all the important facts about your experiment. The library is adopted by 3500+ projects including .NET Core and supported by the .NET Foundation.

So using BenchmarkDotNet, we can get more consistent and repeatable results, with nice reports as well - sounds good to me!

Create a console app

Before creating the benchmarks, we need something for them to run in. Typically this is a console app, so let's create one:

$ dotnet new console -n DotNet5RegexBenchmark

This will create a blank console app named "DotNet5RegexBenchmark" with an empty main method:

using System;

namespace DotNet5RegexBenchmark
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Hello World!");
        }
    }
}

Now we can start to add the benchmark code.

Add BenchmarkDotNet package

Let's add BenchmarkDotNet to the project. It is supplied as either a Nuget package or a .NET Tool. I chose to use the Nuget package in this example:

$ cd DotNet5RegexBenchmark
$ dotnet add package BenchmarkDotNet

We also need to ensure the input file is copied to the output directory so we can find it at runtime.

The project file DotNetRegexBenchmark.csproj should now look like this. Note the TargetFramework version is netcoreapp3.1, BenchmarkDotNet has been added as a package and we are copying the input file to the output directory:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>netcoreapp3.1</TargetFramework>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="BenchmarkDotNet" Version="0.12.0" />
  </ItemGroup>

  <ItemGroup>
    <None Update="input-text.txt">
      <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
    </None>
  </ItemGroup>

</Project>

Ok, now that our console app is created, let's add the benchmarks.

Add the benchmarks

The canonical benchmark example looks like this:

[SimpleJob(RuntimeMoniker.Net472, baseline: true)]
[SimpleJob(RuntimeMoniker.NetCoreApp30)]
[SimpleJob(RuntimeMoniker.CoreRt30)]
[SimpleJob(RuntimeMoniker.Mono)]
[RPlotExporter]
public class Md5VsSha256
{
    private SHA256 sha256 = SHA256.Create();
    private MD5 md5 = MD5.Create();
    private byte[] data;

    [Params(1000, 10000)]
    public int N;

    [GlobalSetup]
    public void Setup()
    {
        data = new byte[N];
        new Random(42).NextBytes(data);
    }

    [Benchmark]
    public byte[] Sha256() => sha256.ComputeHash(data);

    [Benchmark]
    public byte[] Md5() => md5.ComputeHash(data);
}

Without reading the full BenchmarkDotNet docs (which I would advise for some bed time reading), we can see from the example that there are a few key things to setting up a benchmark.

Class definition

Firstly, we should encapsulate the benchmarks in a class. Let's create ours:

namespace DotNet5RegexBenchmark
{
    public class RegexBenchmarks
    {

    }
}

Right, not much going on yet so let's add the setup.

Setup

The Setup is where you want to do anything that's required for each benchmark to run like new-ing up any data. For these benchmarks, we need to pull in the text file that we're applying the Regex patterns to, so this is a great place for doing that:

using System.IO;
using BenchmarkDotNet.Attributes;

namespace DotNet5RegexBenchmark
{
    public class RegexBenchmarks
    {
        private string _data;

        [GlobalSetup]
        public void Setup()
        {
            _data = File.ReadAllText("input-text.txt");
        }
    }
}

Also note that you need to add the [GlobalSetup] attribute to your setup method to tell BenchmarkDotNet that this is your setup method, and it will be called before running the benchmarks.

Now that the setup is done, let's add the benchmarks.

Benchmarks

The best way to manage benchmarks is by creating a method for each one and applying the [Benchmark] attribute to tell BenchmarkDotNet that this method is a Benchmark. So let's add the three benchmarks:

using System.IO;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;

namespace DotNet5RegexBenchmark
{
    public class RegexBenchmarks
    {
        private string _data;

        [GlobalSetup]
        public void Setup()
        {
            _data = File.ReadAllText("input-text.txt");
        }

        [Benchmark]
        public int Email() => Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled).Count;

        [Benchmark]
        public int URI() => Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled).Count;

        [Benchmark]
        public int IP() => Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])", RegexOptions.Compiled).Count;
    }
}

Edit: As pointed out by Stephen Toub, the Regex matching is lazy so you need to actually access the matches to run the matching code. My initial benchmarks were not doing this, as to why I was not seeing the 3-6x speedup. I've now included a .Count operation to actually run all of the matches.

Ok, looks good, almost there. Now let's add the runtimes to target.

Runtimes

The plan is to compare .NET Core 3.1 and .NET 5.0, so let's do that. This can be easily done with some more attributes on the class. Let's also add the attribute to show memory usage as well as it's not turned on by default:

using System.IO;
using System.Text.RegularExpressions;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Jobs;

namespace DotNet5RegexBenchmark
{
    [SimpleJob(RuntimeMoniker.NetCoreApp31, baseline: true)]
    [SimpleJob(RuntimeMoniker.NetCoreApp50)]
    [MemoryDiagnoser]
    public class RegexBenchmarks
    {
        private string _data;

        [GlobalSetup]
        public void Setup()
        {
            _data = File.ReadAllText("input-text.txt");
        }

        [Benchmark]
        public int Email() => Regex.Matches(_data, @"[\w\.+-]+@[\w\.-]+\.[\w\.-]+", RegexOptions.Compiled).Count;

        [Benchmark]
        public int URI() => Regex.Matches(_data, @"[\w]+://[^/\s?#]+[^\s?#]+(?:\?[^\s#]*)?(?:#[^\s]*)?", RegexOptions.Compiled).Count;

        [Benchmark]
        public int IP() => Regex.Matches(_data, @"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9])", RegexOptions.Compiled).Count;
    }
}

Ok, I think the class is done. Let's now hook it up to the Program.

Program

We just need to call the benchmark runner on our class from the main program, and we should be good to go:

using BenchmarkDotNet.Running;

namespace DotNet5RegexBenchmark
{
    class Program
    {
        static void Main(string[] args)
        {
            BenchmarkRunner.Run<RegexBenchmarks>();
        }
    }
}

Let's run this sucker!

Benchmark results

I'm running this in Release configuration with the following command (elevated privelages are required):

$ sudo dotnet run -c Release

and here are the results:

Method Runtime Mean Error StdDev Ratio Gen 0 Gen 1 Gen 2 Allocated
Email .NET Core 3.1 1,357.990 ms 28.9406 ms 33.3280 ms 1.00 - - - 20.99 KB
Email .NET Core 5.0 263.536 ms 1.5500 ms 1.2943 ms 0.19 - - - 23.28 KB
URI .NET Core 3.1 1,138.896 ms 5.4737 ms 4.5708 ms 1.00 - - - 1205.21 KB
URI .NET Core 5.0 228.324 ms 1.3113 ms 1.2266 ms 0.20 - - - 1205.1 KB
IP .NET Core 3.1 94.695 ms 1.2128 ms 1.1344 ms 1.00 - - - 1.36 KB
IP .NET Core 5.0 9.411 ms 0.0423 ms 0.0395 ms 0.10 - - - 1.24 KB

So the .NET Team weren't lying; .NET 5 Regex is definitely faster and cheaper; between 5-10x faster for these benchmarks! It will be nice to see when the .NET Team come out with their own benchmarks as they'll likely be a lot more detailed and scientific than mine, but the future is looking fast!

Summary

The first preview of .NET 5 was recently released and Regex performance was improved. I tested this with BenchmarkDotNet against a few benchmarks and this is definitely the case - between 5-10x faster. The .NET Team will likely release their own statistics on Regex performance as the release firms up. This is only the first preview of .NET 5 so expect more performance improvements in the next previews.

Resources

Posted on by:

marcusturewicz profile

Marcus Turewicz

@marcusturewicz

Machine Learning Engineer, bassist, soccer player, .NET fan, native of the cloud and renewable energy advocate.

Discussion

markdown guide
 

Thanks for the nice write-up, Marcus. One issue I noticed in your final benchmark on Regex. Regex.Matches is lazy, meaning it doesn't actually execute the regex until it needs to. When you run Regex.Matches, it's just fetching the relevant regex from the cache (or parsing/compiling it if it can't find it), and returning to you the collection object that will enable you to retrieve all the matches, but it hasn't done any matching yet. It's only when you iterate the collection, ask for its Count, index into it, etc., that it will compute as little as it needs to in order to answer your question (e.g. if you ask for matches[2], it'll ensure it has the 0th, 1st, and 2nd matches, but it needn't go beyond that yet). So in your benchmark, it's not actually running the regex at all. That's also why, even though the three regexes being tested have various levels of complexity, the benchmarks are all coming back as approximately the same answer. I'd suggest redoing the benchmark with .Count appended to each Matches call, or something like that.

 

Ahh thanks Stephen! I did wonder why I was not seeing the 3-6x speed up that you were... I just thought the benchmark I had chosen was not a good candidate, but clearly I've got some reading to do on Regex! I've now updated to include Count and the results look much better.