loading...
Cover image for Strings Are Evil

Strings Are Evil

indy_singh_uk profile image Indy Singh Updated on ・12 min read

Reducing memory allocations from 7.5GB to 32KB

Contents

  1. Context of the problem
  2. Establishing a baseline
  3. Easy win 1
  4. Easy win 2
  5. Splits are never cool
  6. Lists are not always nice
  7. Pooling byte arrays
  8. Goodbye StringBuilder
  9. Skipping commas
  10. The war between classes and structs
  11. Goodbye StreamReader
  12. TLDR - give me a table

Context of the problem

Codeweavers is a financial services software company, part of what we do is to enable our customers to bulk import their data into our platform. For our services we require up-to-date information from all our clients, which includes lenders and manufacturers across the UK. Each of those imports can contain several hundred megabytes uncompressed data, which will often be imported on a daily basis.

This data is then used to power our real-time calculations. Currently this import process has to take place outside of business hours because of the impact it has on memory usage.

In this article we will explore potential optimisations to the import process specifically within the context of reducing memory during the import process. If you want to have a go yourself, you can use this code to generate a sample input file and you can find all of the code talked about here.

Establishing a baseline

The current implementation uses StreamReader and passes each line to the lineParser.

using (StreamReader reader = File.OpenText(@"..\..\example-input.csv"))
{
    try
    {
        while (reader.EndOfStream == false)
        {
            lineParser.ParseLine(reader.ReadLine());
        }
    }
    catch (Exception exception)
    {
        throw new Exception("File could not be parsed", exception);
    }
}

The most naive implementation of a line parser that we originally had looked something like this:-

public sealed class LineParserV01 : ILineParser
{
    public void ParseLine(string line)
    {
        var parts = line.Split(',');

        if (parts[0] == "MNO")
        {
            var valueHolder = new ValueHolder(line);
        }
    }
}

The ValueHolder class is used later on in the import process to insert information into the database:-

public class ValueHolder
{
    public int ElementId { get; }
    public int VehicleId { get; }
    public int Term { get; }
    public int Mileage { get; }
    public decimal Value { get; }

    public ValueHolder(string line)
    {
        var parts = line.Split(',');

        ElementId = int.Parse(parts[1]);
        VehicleId = int.Parse(parts[2]);
        Term = int.Parse(parts[3]);
        Mileage = int.Parse(parts[4]);
        Value = decimal.Parse(parts[5]);
    }
}

Running this example as a command line application and enabling monitoring:-

public static void Main(string[] args)
{
    AppDomain.MonitoringIsEnabled = true;

    // do the parsing

    Console.WriteLine($"Took: {AppDomain.CurrentDomain.MonitoringTotalProcessorTime.TotalMilliseconds:#,###} ms");
    Console.WriteLine($"Allocated: {AppDomain.CurrentDomain.MonitoringTotalAllocatedMemorySize / 1024:#,#} kb");
    Console.WriteLine($"Peak Working Set: {Process.GetCurrentProcess().PeakWorkingSet64 / 1024:#,#} kb");

    for (int index = 0; index <= GC.MaxGeneration; index++)
    {
        Console.WriteLine($"Gen {index} collections: {GC.CollectionCount(index)}");
    }
}

Our main goal today is to reduce allocated memory. In short, the less memory we allocate, the less work the garbage collector has to do. There are three generations that garbage collector operates against, we will also be monitoring those. Garbage collection is a complex topic and outside of the scope of this article; but a good rule of thumb is that short-lived objects should never be promoted past generation 0.

We can see V01 has the following statistics:-

Took: 8,750 ms
Allocated: 7,412,303 kb
Peak Working Set: 16,720 kb
Gen 0 collections: 1809
Gen 1 collections: 0
Gen 2 collections: 0

Almost 7.5 GB of memory allocations to parse a three hundred megabyte file is less than ideal. Now that we have established the baseline, let us find some easy wins…

Easy win 1

Eagle-eyed readers will have spotted that we string.Split(',') twice; once in the line parser and again in the constructor of ValueHolder. This is wasteful, we can overload the constructor of ValueHolder to accept a string[] array and split the line once in the parser. After that simple change the statistics for V02 are now:-

Took: 6,922 ms
Allocated: 4,288,289 kb
Peak Working Set: 16,716 kb
Gen 0 collections: 1046
Gen 1 collections: 0
Gen 2 collections: 0

Great! We are down from 7.5GB to 4.2GB. But that is still a lot of memory allocations for processing a three hundred megabyte file.

Easy win 2

Quick analysis of the input file reveals that there are 10,047,435 lines, we are only interested in lines that are prefixed with MNO of which there are 10,036,466 lines. That means we are unnecessarily processing an additional 10,969 lines. A quick change to V03 to only parse lines prefixed with MNO:-

public sealed class LineParserV03 : ILineParser
{
    public void ParseLine(string line)
    {
        if (line.StartsWith("MNO"))
        {
            var valueHolder = new ValueHolder(line);
        }
    }
}

This means we defer splitting the entire line until we know it is a line we are interested in. Unfortunately this did not save us much memory. Mainly because we are interested in 99.89% of the lines in the file. The statistics for V03:-

Took: 8,375 ms
Allocated: 4,284,873 kb
Peak Working Set: 16,744 kb
Gen 0 collections: 1046
Gen 1 collections: 0
Gen 2 collections: 0

It is time to break out the trusty profiler, in this case dotTrace:-

Strings in the .NET ecosystem are immutable. Meaning that anything we do to a string always returns a brand new copy. Therefore calling string.Split(',') on every line (remember there are 10,036,466 lines we are interested in) returns that line split into several smaller strings. Each line at minimum has five sections we want to process. That means in the lifetime of the import process we create at least 50,182,330 strings..! Next we will explore what we can do to eliminate the use of string.Split(',').

Splits are never cool

A typical line we are interested in looks something like this:-

MNO,3,813496,36,30000,78.19,,

Calling string.Split(',') on the above line will return a string[] containing:-

'MNO'
'3'
'813496'
'36'
'30000'
'78.19'
''
''

Now at this point we can make some guarantees about the file we are importing:-

  • The length of each line is not fixed
  • The number of sections that are delimited by a comma are fixed
  • We only use the first three characters of each line to determine our interest in the line
  • This means there are five sections we are interested in but the section length is unknown
  • Sections do not change locations (e.g MNO is always the first section)

Guarantees established, we can now build a short lived index of the positions of all the commas for a given line:-

private List<int> FindCommasInLine(string line)
{
    var list = new List<int>();

    for (var index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            list.Add(index);
        }
    }

    return list;
}

Once we know the position of each comma, we can directly access the section we care about and manually parse that section.

private decimal ParseSectionAsDecimal(int start, int end, string line)
{
    var sb = new StringBuilder();

    for (var index = start; index < end; index++)
    {
        sb.Append(line[index]);
    }

    return decimal.Parse(sb.ToString());
}

private int ParseSectionAsInt(int start, int end, string line)
{
    var sb = new StringBuilder();

    for (var index = start; index < end; index++)
    {
        sb.Append(line[index]);
    }

    return int.Parse(sb.ToString());
}

Putting it all together:-

public void ParseLine(string line)
{
    if (line.StartsWith("MNO"))
    {
        var findCommasInLine = FindCommasInLine(line);

        var elementId = ParseSectionAsInt(findCommasInLine[0] + 1, findCommasInLine[1], line); // equal to parts[1] - element id
        var vehicleId = ParseSectionAsInt(findCommasInLine[1] + 1, findCommasInLine[2], line); // equal to parts[2] - vehicle id
        var term = ParseSectionAsInt(findCommasInLine[2] + 1, findCommasInLine[3], line); // equal to parts[3] - term
        var mileage = ParseSectionAsInt(findCommasInLine[3] + 1, findCommasInLine[4], line); // equal to parts[4] - mileage
        var value = ParseSectionAsDecimal(findCommasInLine[4] + 1, findCommasInLine[5], line); // equal to parts[5] - value
        var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
    }
}

Running V04 reveals this statistics:-

Took: 9,813 ms
Allocated: 6,727,664 kb
Peak Working Set: 16,872 kb
Gen 0 collections: 1642
Gen 1 collections: 0
Gen 2 collections: 0

Whoops, that is worse than expected. It is an easy mistake to make but dotTrace can help us here…

Constructing a StringBuilder for every section in every line is incredibly expensive. Luckily it is a quick fix, we constructor a single StringBuilder on the construction of V05 and clear it before each usage. V05 now has the following statistics:-

Took: 9,125 ms
Allocated: 3,199,195 kb
Peak Working Set: 16,636 kb
Gen 0 collections: 781
Gen 1 collections: 0
Gen 2 collections: 0

Phew we are back on the downwards trends. We started at 7.5GB and now we are down to 3.2GB.

Lists are not always nice

At this point dotTrace becomes an essential part of the optimisation process. Looking at V05 dotTrace output:-

Building the short lived index of commas positions is expensive. As underneath any List<T> is just a standard T[] array. The framework takes care of re-sizing the underlying array when elements are added. This is useful and very handy in typical scenarios. However, we know that there are six sections we need to process (but we are only interested in five of those sections), ergo there are at least seven commas we want indexes for. We can optimise for that:-

private int[] FindCommasInLine(string line)
{
    var nums = new int[7];
    var counter = 0;

    for (var index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

V06 statistics:-

Took: 8,047 ms
Allocated: 2,650,318 kb
Peak Working Set: 16,560 kb
Gen 0 collections: 647
Gen 1 collections: 0
Gen 2 collections: 0

2.6GB is pretty good, but what happens if we force the compiler to use byte for this method instead of the compiler defaulting to use int:-

private byte[] FindCommasInLine(string line)
{
    byte[] nums = new byte[7];
    byte counter = 0;

    for (byte index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

Re-running V06:-

Took: 8,078 ms
Allocated: 2,454,297 kb
Peak Working Set: 16,548 kb
Gen 0 collections: 599
Gen 1 collections: 0
Gen 2 collections: 0

2.6GB was pretty good, 2.4GB is even better. This is because an int has a much larger range than a byte.

Pooling byte arrays

V06 now has a byte[] array that holds the index of each comma for each line. It is a short lived array, but it is created many times. We can eliminate the cost of creating a new byte[] for each line by using a recent addition to the .NET ecosystem; Systems.Buffers. Adam Sitnik has a great breakdown on using it and why you should. The important thing to remember when using ArrayPool<T>.Shared is you must always return the rented buffer after you are done using it otherwise you will introduce a memory leak into your application.

This is what V07 looks like:-

public void ParseLine(string line)
{
    if (line.StartsWith("MNO"))
    {
        var tempBuffer = _arrayPool.Rent(7);

        try
        {
            var findCommasInLine = FindCommasInLine(line, tempBuffer);
            // truncated for brevity
        }
        finally
        {
            _arrayPool.Return(tempBuffer, true);
        }
    }
}

private byte[] FindCommasInLine(string line, byte[] nums)
{
    byte counter = 0;

    for (byte index = 0; index < line.Length; index++)
    {
        if (line[index] == ',')
        {
            nums[counter++] = index;
        }
    }

    return nums;
}

And V07 has the following statistics:-

Took: 8,891 ms
Allocated: 2,258,272 kb
Peak Working Set: 16,752 kb
Gen 0 collections: 551
Gen 1 collections: 0
Gen 2 collections: 0

Down to 2.2GB, having started at 7.5GB. It is pretty good, but we are not done yet.

Goodbye StringBuilder

Profiling V07 reveals the next problem:-

Calling StringBuilder.ToString() inside of the decimal and int parsers is incredibly expensive. It is time to deprecate StringBuilder and write our own1 int and decimal parsers without relying on strings and calling int.parse() / decimal.parse(). According to the profiler this should shave off around 1GB. After writing our own int and decimal parsers V08 now clocks in at:-

Took: 6,047 ms
Allocated: 1,160,856 kb
Peak Working Set: 16,816 kb
Gen 0 collections: 283
Gen 1 collections: 0
Gen 2 collections: 0

1.1GB is a huge improvement from where we were last (2.2GB) and even better than the baseline (7.5GB).

1Code can be found here

Skipping commas

Until V08 our strategy has been to find the index of every comma on each line and then use that information to create a sub-string which is then parsed by calling int.parse() / decimal.parse(). V08 deprecates the use of sub-strings but still uses the short lived index of comma positions.

An alternative strategy would be to skip to the section we are interested in by counting the number of preceding commas then parse anything after the required number of commas and return when we hit the next comma.

We have previously guaranteed that:-

  • Each section is preceded by a comma.
  • And that the location of each section within a line does not change.

This would also means we can deprecate the rented byte[] array because we are no longer building a short lived index:-

public sealed class LineParserV09 : ILineParser
{
    public void ParseLine(string line)
    {
        if (line.StartsWith("MNO"))
        {
            int elementId = ParseSectionAsInt(line, 1); // equal to parts[1] - element id
            int vehicleId = ParseSectionAsInt(line, 2); // equal to parts[2] - vehicle id
            int term = ParseSectionAsInt(line, 3); // equal to parts[3] - term
            int mileage = ParseSectionAsInt(line, 4); // equal to parts[4] - mileage
            decimal value = ParseSectionAsDecimal(line, 5); // equal to parts[5] - value
            var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
        }
    }
}

Unfortunately V09 does not save us any memory, it does however reduce the time taken:-

Took: 5,703 ms
Allocated: 1,160,856 kb
Peak Working Set: 16,572 kb
Gen 0 collections: 283
Gen 1 collections: 0
Gen 2 collections: 0

Another benefit of V09 is that it reads much more closer to the original implementation.

The war between classes and structs

This blog post is not going to cover the difference or the pros/cons of classes vs structs. That topic has been covered many times. In this particular context, it is beneficial to use a struct. Changing ValueHolder to a struct in V10 has the following statistics:-

Took: 5,594 ms
Allocated: 768,803 kb
Peak Working Set: 16,512 kb
Gen 0 collections: 187
Gen 1 collections: 0
Gen 2 collections: 0

Finally, we are below the 1GB barrier. Also, word of warning please do not use a struct blindly, always test your code and make sure the use case is correct.

Goodbye StreamReader

As of V10 the line parser itself is virtually allocation free. dotTrace reveals where the remaining allocations occur:-

Well this is awkward, the framework is costing us memory allocations. We can interact with the file at a lower-level than a StreamReader:-

private static void ViaRawStream(ILineParser lineParser)
{
    var sb = new StringBuilder();

    using (var reader = File.OpenRead(@"..\..\example-input.csv"))
    {
        try
        {
            bool endOfFile = false;
            while (reader.CanRead)
            {
                sb.Clear();

                while (endOfFile == false)
                {
                    var readByte = reader.ReadByte();

                    // -1 means end of file
                    if (readByte == -1)
                    {
                        endOfFile = true;
                        break;
                    }

                    var character = (char)readByte;

                    // this means the line is about to end so we skip
                    if (character == '\r')
                    {
                        continue;
                    }

                    // this line has ended
                    if (character == '\n')
                    {
                        break;
                    }

                    sb.Append(character);
                }

                if (endOfFile)
                {
                    break;
                }

                var buffer = new char[sb.Length];

                for (int index = 0; index < sb.Length; index++)
                {
                    buffer[index] = sb[index];
                }

                lineParser.ParseLine(buffer);
            }
        }
        catch (Exception exception)
        {
            throw new Exception("File could not be parsed", exception);
        }
    }
}

V11 statistics:-

Took: 5,594 ms
Allocated: 695,545 kb
Peak Working Set: 16,452 kb
Gen 0 collections: 169
Gen 1 collections: 0
Gen 2 collections: 0

Well, 695MB is still better than 768MB. Okay, that was not the improvement I was expecting (and rather anti-climatic). Until, we remember we have previously seen and solved this problem before. In V07 we used ArrayPool<T>.Shared to prevent lots of small byte[]. We can do the same here:-

private static void ViaRawStream(ILineParser lineParser)
{
    var sb = new StringBuilder();
    var charPool = ArrayPool<char>.Shared;

    using (var reader = File.OpenRead(@"..\..\example-input.csv"))
    {
        try
        {
            bool endOfFile = false;
            while (reader.CanRead)
            {
                // truncated for brevity

                char[] rentedCharBuffer = charPool.Rent(sb.Length);

                try
                {
                    for (int index = 0; index < sb.Length; index++)
                    {
                        rentedCharBuffer[index] = sb[index];
                    }

                    lineParser.ParseLine(rentedCharBuffer);
                }
                finally
                {
                    charPool.Return(rentedCharBuffer, true);
                }
            }
        }
        catch (Exception exception)
        {
            throw new Exception("File could not be parsed", exception);
        }
    }
}

The final version of V11 has the following statistics:-

Took: 6,781 ms
Allocated: 32 kb
Peak Working Set: 12,620 kb
Gen 0 collections: 0
Gen 1 collections: 0
Gen 2 collections: 0

Yes, only 32kb of memory allocations. That is the climax I was looking for.

Find me Twitter, LinkedIn, or GitHub.

TLDR - Give me a table

Version Took (ms) Allocated (kb) Peak Working Set (kb) Gen 0 Collections
01 8,750 7,412,303 16,720 1,809
02 6,922 4,288,289 16,716 1,046
03 8,375 4,284,873 16,744 1,046
04 9,813 6,727,664 16,872 1,642
05 8,125 3,199,195 16,636 781
06 8,078 2,454,297 16,548 599
07 8,891 2,258,272 16,752 551
08 6,047 1,160,856 16,816 283
09 5,703 1,160,856 16,572 283
10 5,594 768,803 16,512 187
11 6,781 32 12,620 0

Posted on by:

indy_singh_uk profile

Indy Singh

@indy_singh_uk

- Performance & optimisation is my wheelhouse - Passionate about technical excellence - Extreme programming & lean software development - Trust in the scientific method

Discussion

markdown guide
 

To add to Kasey's comments about using LINQ, this would help you out considerably. Your line parser is basically just a function, right? There's no need to stand up an entire interface for that. It's basically a Func<string, TOut>, where TOut is whatever your desired output type. You're also using methods vs. functions in there. In other words, you have a lot of void return types instead of functions that return the actual values that you want. I would prefer to see something like


var values = 
    File.ReadLines(@"..\..\example-input.csv")
        .Where(x => x.StartsWith("MNO"))
        .Select(LineParser.ParseLine);

That will read in lines lazily, filter out any line that starts with MNO, and then parse your lines out. On top of that, you can parallelize it by sticking a .AsParallel() in before the Select statement, if you're concerned about speed. I'd be curious to see how these affected your processing speed.

 

Hi Jason,

Interesting idea. I will admit that I did not consider LINQ, if I get time I will try your approach (and if you have time feel free to send in a pull request).

Cheers,
Indy

 

Hi Jason, do you have memory consumption using linq?

 

Thanks for the article!

Did you consider processing the file using LINQ methods? They are lazy evaluated. Which means there is no memory allocation for data until you try to materialize the results (e.g. with ToArray). And during processing the only memory usage should be for the current iteration. So aside from IEnumerable overhead, the only memory usage should be what is in the file read buffer plus any locals you are using.

Here is a good example. However, rather than using the LINQ syntax, I tend to use the extension methods.

Instead of

var typedSequence = from line in ReadFrom(path)
                    let record = ParseLine(line)
                    where record.Active // for example
                    select record.Key;

I like this better

var typedSequence = ReadFrom(path)
    .Select(line => ParseLine(line))
    .Where(record => record.Active)
    .Select(record => record.Key)
    ;

It is admittedly a bit uglier with all the extra symbols, but it does not feel as foreign with the rest of my code as the LINQ syntax.

If you debug through this, you will see that one line goes through all of the steps in order. Maybe it does not match and gets stopped at the Where clause. Then the next line is read and goes thru the steps.

I find that (once you get used to the syntax) Linq is much more understandable that solutions using imperative loops. I see where and I know that all it is doing is filtering. I see select and I know it is only converting the item to a different form. Whereas imperative loops have to be read carefully to make sure of exactly what is being performed.

Your hand coded optimization at the bottom is clearly going to be superior in resource usage, but it trades off to be hard to understand and maintain. Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.

In any case, thank you for posting this!

 

Hi Kasey,

Thanks for the detailed reply. No I did not consider LINQ.

Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.

I would be very interested in any sample code that showed a large reduction in allocations.

I do agree that having a LINQ solution would greatly improve readability.

Glad you enjoyed the article!

Cheers,
Indy

 

I am glad you posted this response detailing the usage of LINQ. When I started reading this article, I was thinking that it would be some interesting LINQ wizardry. However, it is always nice to see optimizations in any form.

 

"no memory allocation ... until you materialize the results"

In this case, aren't you going to perform exactly the same allocations, then?

Unless... LINQ designers may have gone through the same optimization process @Indy has :) and reused buffers all along.

 

The point I was making there was that LINQ methods to not load or run anything when they are declared (a common mistaken assumption) -- they only "do something" when they are enumerated.

The main issue from the original solution was run-away memory usage, because all the rows are loaded into memory at once. Then a new intermediate object is allocated at each step for every row. So memory usage is roughly: number of rows * (row size + intermediate objects sizes)

Using LINQ as I mentioned, only one row is fully processed at a time before fetching the next row. So at most you have one set of allocations for the row and each intermediate object. So memory usage is roughly: row size + intermediate objects sizes.

Any solution processing files would probably also do well to buffer the results to an IO output, to avoid collecting large result sets in memory.

If Garbage Collector performance is an issue, that can be optimized separately. Common strategies are: value types (allocated on stack frame and copied when passed in or returned to other stack frames), or a pre-allocated object pool, or if you need the same consistent set of objects for each row, then a set of singleton objects is equivalent to an object pool of size 1... just remember to reset them between iterations.

 

TLDR version:
Don’t use string manipulations to handle a proper parser’s workload.

Writing parsers that work on byte/char streams will always outperform other solutions. It takes more time up front, but less in the total run.

 

What parser libraries would you use?

 

Nice walk through of optimising for memory allocations / GC avoidance, using the right tools (profiler measurements), thanks :) I have done similar for heavy lifting code in Java (I still get the flashbacks!)

From your table it looks like all that work shaved 2secs off the run time (~23%) and 4MB off the peak RAM usage (~25%), which may be worth the impacts to the code in your case but not for every case - again a nice example to think about when and where to optimise, or if there are design changes that would help: back to that profiler... or change language!

 

Hi Phil,

Thanks for the positive comments.

With regards to time taken and peak working set, that was not our main focus. Our main focus was to reduce the number of allocations and reduce the time spent in GC. Unfortunately this data import process is not a standalone one time application. It is a part of a much larger application which is core to our entire platform.

You are right about the trade off between performance and readability. It is always a fine line!

Cheers,
Indy

 

Hi Indy, you're welcome - I wondered if you might have been addressing a wider issue with the GC, hence my slightly facetious comment about changing language at the end (ie: might as well write in C!). I certainly considered abandoning my Java code a few times but was greatly helped by MapDB and other off-heap data handling libs, which allowed me to stay in the target ecosystem (I was working on a large scale Java static analysis package at the time).

 

I might have missed this in the other comments, but I don't understand why you've written your own parsing function. .Net has objects specifically for reading in delimited files. It even has the ability to handle malformed lines instead of rejecting the entire file.

The other thing I was curious about...Unless you are doing additional manipulation of the data before writing it to the database, and assuming you're using MS SQL Server, why not just read the file (using msdn.microsoft.com/en-us/library/m...) and the use the SQL Bulk Insert object (blogs.msdn.microsoft.com/nikhilsi/...). It's fast and efficient.

 
Sloan, the sloth mascot Comment marked as low quality/non-constructive by the community View code of conduct

I think the method used is quite bad, cumbersome, inventing the wheel. (and it's all because the data source isn't properly designed?).
plus misleading title.

 

So not sure why but I got an email linking to this article. You've done a good job of showing how to speed up your C#. But honestly the real answer here is use a CSV parsing library. Every language has one. You might also want to branch out into other languages. Perl for example is literally your super optimized solution with just this fairly obvious and readable code:

gist.github.com/MattOates/718cea59...

 

Thank you for posting!
Very interesting.

One thing I don't understand is why you used StringBuilder in ParseSectionAs* in the first place.
Wasn't it be easier and more efficient just call a Substring?

 

Hi Vadim,

I suspect although have not confirmed that using string.Substring instead of StringBuilder would have worst characteristics. Hopefully I will get time to write a follow up article with everyone's ideas and suggestion.

In the mean time feel free to have a go yourself. A couple of people have already started sent a pull requests.

Regards,
Indy

 

I was a little disappointed I didn't see you drop into MemoryMappedFile for speed and use stackalloc byte[] to skip the garbage collector on byte arrays used locally (you didn't say anything about avoiding the 'unsafe' keyword).

 

ArrayPool is a very interesting option. Unlike your use case I have no control over the form of my incoming strings (and they're usually horrid trash). I do a bajillion manipulations on them including a lot of splitting, substringing, replacements, converting to List, and StringBuilding. I'm going to study what all I'm doing to see if I can use the power of the ArrayPool to replace some of my admittedly brutal (and as-yet untuned-for-performance) methods.

 

Hi Mardoch,

It would be great to see what you can achieve using the things I learned in the article!

Look forward to your results!

Cheers,
Indy

 

I feel like the first thing I would have done is pass around the reader itself.

LineParser.ParseLine(TextReader reader) {
    try {
        if(ValidateSection("MNO", reader)){
            int elementId = ParseIntSection(reader);
            int vehicleId = ParseIntSection(reader);
            int term = ParseIntSection(reader);
            int mileage = ParseIntSection(reader);
            decimal value = ParseDecimalSection(reader);
            var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
       }
    } finally {
        ReadToNewline(reader);
    }
}
 

In response to the comment,

Hi Steve,

Glad you enjoyed it! I agree that sometimes the easy route is taken by default. But >this particular piece of code was around before I joined the company. Sometimes it >is better to choose the easy route and then improve the code if it becomes a >problem.

Regards,
Indy

Remember, "premature optimization is the root of all evil." This famous quote by Sir Tony Hoare (popularized by Donald Knuth). We should strive to make our code readable first and foremost. If, after running and it is deemed to be unsatisfactory, at that point only one should profile the code and attack the worst culprit, run again and if that works sufficiently well, then stop optimizing.

This approach is evidenced in the article, but the comment softened the message. I have dealt with a lot of code that someone made a function that ran one time upon initialization so baroque that it was nearly impossible to determine what the goal was.

 

The whole "Premature optimization is evil" is little played out. I like to use "common sense" optimization, eg using a HashSet for lookups, stream file/dB results vs. loading the entire set into memory, etc.

Readable code is super important. Memory effecient and optimized code is important if you are running in AWS Lambda with thousands of invocations.

 

Hi, I noticed in your code that you are creating ValueHolder/ValueHolderAsStruct but you don't store it in a list.
If that line is uncommented (in v11), so that ValueHolderAsStruct is stored in a list, then memory allocation goes up to around 1 GB.
What is interesting is that if I then replace ValueHolderAsStruct with ValueHolder class, allocated memory actually goes down almost by half to 523MB.
It seems that for some reason it's actually "cheaper" to hold a list of classes that a list of structs. Do you know why this may be?

 

Going down the revisions your parser looks more and more like good’ole C implementation. Now, forgive .NET string, get \0 terminated char*’s and we’ve back home :) All we’re trying to say here is that .NET is not very good at simple data processing...

 

Should not the compiler do it for you?
You basically wrote the code for parsing CSV with fixed types.
All allocation patterns were straightforward, so compiler should in theory make a reusable buffers, and do away with most allocations.

 

7.5 GB to 32KB is very misleading. It is a measurement problem to me.

Write 3GB to same array position and it's not counted. Write 3GB to sequential objects and it starts getting counted as ”allocation”.

Anyway the file is 3.7GB and atleast 3.7GB has been read into memory. In case of array index replacement it doesn't come up in the statistics.. because you are using only object allocation as measurement.

It's typical accounting issue... If you sell to your subsidiary is it counted as sales?

 

Awesome post.
I really enjoyed reading it. You've achieved big reduce of memory usage, so your mission has been accomplished. I quite sure that I'll use techniques presented by you.

I wish you next successes in your work and I'm waiting for next articles from you.

Cheers.

 

Very interesting article!
It inspired me to make my own attempt at this. My focus was on trying to reduce execution time and I reduced it with 60-70% (at least on my machine :) ).
Please take a look at:
qoutify@bitbucket.org/qoutify/evil...

Grateful for any comments!

 

Great job refactoring. This was fun to read. I would be interested to see how LINQ performs in this context.

 
 

Hi Artyom,

Glad you enjoyed the article!

Cheers,
Indy

 
 

I'd like to see a benchmark using Regular Expressions on this problem.

 

Great Article! Really got me thinking about how we often take the easy route by default, and maybe there are some simple general optimisations that are drop in replacements.

 

Hi Steve,

Glad you enjoyed it! I agree that sometimes the easy route is taken by default. But this particular piece of code was around before I joined the company. Sometimes it is better to choose the easy route and then improve the code if it becomes a problem.

Regards,
Indy

 

Hi Robb,

Thanks for the positive comments, sorry disappoint on the stringly typed data.

Cheers,
Indy