DEV Community

Cover image for Strings Are Evil

Strings Are Evil

Indy Singh on June 14, 2018

Reducing memory allocations from 7.5GB to 32KB Contents Context of the problem Establishing a baseline Easy win 1 Easy win 2 Splits ar...
Collapse
 
jdsteinhauser profile image
Jason Steinhauser

To add to Kasey's comments about using LINQ, this would help you out considerably. Your line parser is basically just a function, right? There's no need to stand up an entire interface for that. It's basically a Func<string, TOut>, where TOut is whatever your desired output type. You're also using methods vs. functions in there. In other words, you have a lot of void return types instead of functions that return the actual values that you want. I would prefer to see something like


var values = 
    File.ReadLines(@"..\..\example-input.csv")
        .Where(x => x.StartsWith("MNO"))
        .Select(LineParser.ParseLine);

That will read in lines lazily, filter out any line that starts with MNO, and then parse your lines out. On top of that, you can parallelize it by sticking a .AsParallel() in before the Select statement, if you're concerned about speed. I'd be curious to see how these affected your processing speed.

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Jason,

Interesting idea. I will admit that I did not consider LINQ, if I get time I will try your approach (and if you have time feel free to send in a pull request).

Cheers,
Indy

Collapse
 
mmuekk profile image
miguel miranda

Hi Jason, do you have memory consumption using linq?

Collapse
 
kspeakman profile image
Kasey Speakman • Edited

Thanks for the article!

Did you consider processing the file using LINQ methods? They are lazy evaluated. Which means there is no memory allocation for data until you try to materialize the results (e.g. with ToArray). And during processing the only memory usage should be for the current iteration. So aside from IEnumerable overhead, the only memory usage should be what is in the file read buffer plus any locals you are using.

Here is a good example. However, rather than using the LINQ syntax, I tend to use the extension methods.

Instead of

var typedSequence = from line in ReadFrom(path)
                    let record = ParseLine(line)
                    where record.Active // for example
                    select record.Key;

I like this better

var typedSequence = ReadFrom(path)
    .Select(line => ParseLine(line))
    .Where(record => record.Active)
    .Select(record => record.Key)
    ;

It is admittedly a bit uglier with all the extra symbols, but it does not feel as foreign with the rest of my code as the LINQ syntax.

If you debug through this, you will see that one line goes through all of the steps in order. Maybe it does not match and gets stopped at the Where clause. Then the next line is read and goes thru the steps.

I find that (once you get used to the syntax) Linq is much more understandable that solutions using imperative loops. I see where and I know that all it is doing is filtering. I see select and I know it is only converting the item to a different form. Whereas imperative loops have to be read carefully to make sure of exactly what is being performed.

Your hand coded optimization at the bottom is clearly going to be superior in resource usage, but it trades off to be hard to understand and maintain. Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.

In any case, thank you for posting this!

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Kasey,

Thanks for the detailed reply. No I did not consider LINQ.

Using Linq properly can get you a large percentage of the same gains in memory usage (and consequently runtime) while still being pretty expressive.

I would be very interested in any sample code that showed a large reduction in allocations.

I do agree that having a LINQ solution would greatly improve readability.

Glad you enjoyed the article!

Cheers,
Indy

Collapse
 
douglasjreynolds profile image
Doug Reynolds

I am glad you posted this response detailing the usage of LINQ. When I started reading this article, I was thinking that it would be some interesting LINQ wizardry. However, it is always nice to see optimizations in any form.

Collapse
 
xtofl profile image
xtofl

"no memory allocation ... until you materialize the results"

In this case, aren't you going to perform exactly the same allocations, then?

Unless... LINQ designers may have gone through the same optimization process @Indy has :) and reused buffers all along.

Collapse
 
kspeakman profile image
Kasey Speakman

The point I was making there was that LINQ methods to not load or run anything when they are declared (a common mistaken assumption) -- they only "do something" when they are enumerated.

The main issue from the original solution was run-away memory usage, because all the rows are loaded into memory at once. Then a new intermediate object is allocated at each step for every row. So memory usage is roughly: number of rows * (row size + intermediate objects sizes)

Using LINQ as I mentioned, only one row is fully processed at a time before fetching the next row. So at most you have one set of allocations for the row and each intermediate object. So memory usage is roughly: row size + intermediate objects sizes.

Any solution processing files would probably also do well to buffer the results to an IO output, to avoid collecting large result sets in memory.

If Garbage Collector performance is an issue, that can be optimized separately. Common strategies are: value types (allocated on stack frame and copied when passed in or returned to other stack frames), or a pre-allocated object pool, or if you need the same consistent set of objects for each row, then a set of singleton objects is equivalent to an object pool of size 1... just remember to reset them between iterations.

Collapse
 
elmuerte profile image
Michiel Hendriks

TLDR version:
Don’t use string manipulations to handle a proper parser’s workload.

Writing parsers that work on byte/char streams will always outperform other solutions. It takes more time up front, but less in the total run.

Collapse
 
xtofl profile image
xtofl

What parser libraries would you use?

Collapse
 
iplaykeys profile image
iplaykeys

I might have missed this in the other comments, but I don't understand why you've written your own parsing function. .Net has objects specifically for reading in delimited files. It even has the ability to handle malformed lines instead of rejecting the entire file.

The other thing I was curious about...Unless you are doing additional manipulation of the data before writing it to the database, and assuming you're using MS SQL Server, why not just read the file (using msdn.microsoft.com/en-us/library/m...) and the use the SQL Bulk Insert object (blogs.msdn.microsoft.com/nikhilsi/...). It's fast and efficient.

Collapse
 
wplj profile image
Comment marked as low quality/non-constructive by the community. View Code of Conduct
wplj

I think the method used is quite bad, cumbersome, inventing the wheel. (and it's all because the data source isn't properly designed?).
plus misleading title.

Collapse
 
phlash profile image
Phil Ashby

Nice walk through of optimising for memory allocations / GC avoidance, using the right tools (profiler measurements), thanks :) I have done similar for heavy lifting code in Java (I still get the flashbacks!)

From your table it looks like all that work shaved 2secs off the run time (~23%) and 4MB off the peak RAM usage (~25%), which may be worth the impacts to the code in your case but not for every case - again a nice example to think about when and where to optimise, or if there are design changes that would help: back to that profiler... or change language!

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Phil,

Thanks for the positive comments.

With regards to time taken and peak working set, that was not our main focus. Our main focus was to reduce the number of allocations and reduce the time spent in GC. Unfortunately this data import process is not a standalone one time application. It is a part of a much larger application which is core to our entire platform.

You are right about the trade off between performance and readability. It is always a fine line!

Cheers,
Indy

Collapse
 
phlash profile image
Phil Ashby

Hi Indy, you're welcome - I wondered if you might have been addressing a wider issue with the GC, hence my slightly facetious comment about changing language at the end (ie: might as well write in C!). I certainly considered abandoning my Java code a few times but was greatly helped by MapDB and other off-heap data handling libs, which allowed me to stay in the target ecosystem (I was working on a large scale Java static analysis package at the time).

Collapse
 
vpashkov profile image
Vadim

Thank you for posting!
Very interesting.

One thing I don't understand is why you used StringBuilder in ParseSectionAs* in the first place.
Wasn't it be easier and more efficient just call a Substring?

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Vadim,

I suspect although have not confirmed that using string.Substring instead of StringBuilder would have worst characteristics. Hopefully I will get time to write a follow up article with everyone's ideas and suggestion.

In the mean time feel free to have a go yourself. A couple of people have already started sent a pull requests.

Regards,
Indy

Collapse
 
theericwilson profile image
Eric Wilson • Edited

In response to the comment,

Hi Steve,

Glad you enjoyed it! I agree that sometimes the easy route is taken by default. But >this particular piece of code was around before I joined the company. Sometimes it >is better to choose the easy route and then improve the code if it becomes a >problem.

Regards,
Indy

Remember, "premature optimization is the root of all evil." This famous quote by Sir Tony Hoare (popularized by Donald Knuth). We should strive to make our code readable first and foremost. If, after running and it is deemed to be unsatisfactory, at that point only one should profile the code and attack the worst culprit, run again and if that works sufficiently well, then stop optimizing.

This approach is evidenced in the article, but the comment softened the message. I have dealt with a lot of code that someone made a function that ran one time upon initialization so baroque that it was nearly impossible to determine what the goal was.

Collapse
 
douglasjreynolds profile image
Doug Reynolds

The whole "Premature optimization is evil" is little played out. I like to use "common sense" optimization, eg using a HashSet for lookups, stream file/dB results vs. loading the entire set into memory, etc.

Readable code is super important. Memory effecient and optimized code is important if you are running in AWS Lambda with thousands of invocations.

Collapse
 
mattoates profile image
Matt Oates • Edited

So not sure why but I got an email linking to this article. You've done a good job of showing how to speed up your C#. But honestly the real answer here is use a CSV parsing library. Every language has one. You might also want to branch out into other languages. Perl for example is literally your super optimized solution with just this fairly obvious and readable code:

gist.github.com/MattOates/718cea59...

Collapse
 
ilmtitan profile image
Jim Przybylinski • Edited

I feel like the first thing I would have done is pass around the reader itself.

LineParser.ParseLine(TextReader reader) {
    try {
        if(ValidateSection("MNO", reader)){
            int elementId = ParseIntSection(reader);
            int vehicleId = ParseIntSection(reader);
            int term = ParseIntSection(reader);
            int mileage = ParseIntSection(reader);
            decimal value = ParseDecimalSection(reader);
            var valueHolder = new ValueHolder(elementId, vehicleId, term, mileage, value);
       }
    } finally {
        ReadToNewline(reader);
    }
}
Collapse
 
mheyman profile image
Michael Heyman

I was a little disappointed I didn't see you drop into MemoryMappedFile for speed and use stackalloc byte[] to skip the garbage collector on byte arrays used locally (you didn't say anything about avoiding the 'unsafe' keyword).

Collapse
 
gwunhar profile image
Mardoch

ArrayPool is a very interesting option. Unlike your use case I have no control over the form of my incoming strings (and they're usually horrid trash). I do a bajillion manipulations on them including a lot of splitting, substringing, replacements, converting to List, and StringBuilding. I'm going to study what all I'm doing to see if I can use the power of the ArrayPool to replace some of my admittedly brutal (and as-yet untuned-for-performance) methods.

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Mardoch,

It would be great to see what you can achieve using the things I learned in the article!

Look forward to your results!

Cheers,
Indy

Collapse
 
ashutoshvaidya profile image
Ashutosh Vaidya

Oh, how much I wish I discovered this earlier than today. This is pretty similar to the example in my previous job: a financial domain, bulk import, and huge records that need to be parsed after business hours. I would have loved to see if some or all the advice used in this post could have been applied there.

Nevertheless, this is a great post to keep in your collection for future use. Thanks for the detailed explanation and for showing all the statistics.

Collapse
 
discotord profile image
discotord

Very interesting article!
It inspired me to make my own attempt at this. My focus was on trying to reduce execution time and I reduced it with 60-70% (at least on my machine :) ).
Please take a look at:
qoutify@bitbucket.org/qoutify/evil...

Grateful for any comments!

Collapse
 
vigzel profile image
vigzel

Hi, I noticed in your code that you are creating ValueHolder/ValueHolderAsStruct but you don't store it in a list.
If that line is uncommented (in v11), so that ValueHolderAsStruct is stored in a list, then memory allocation goes up to around 1 GB.
What is interesting is that if I then replace ValueHolderAsStruct with ValueHolder class, allocated memory actually goes down almost by half to 523MB.
It seems that for some reason it's actually "cheaper" to hold a list of classes that a list of structs. Do you know why this may be?

Collapse
 
theomails profile image
Theodore Ravindranath • Edited

7.5 GB to 32KB is very misleading. It is a measurement problem to me.

Write 3GB to same array position and it's not counted. Write 3GB to sequential objects and it starts getting counted as ”allocation”.

Anyway the file is 3.7GB and atleast 3.7GB has been read into memory. In case of array index replacement it doesn't come up in the statistics.. because you are using only object allocation as measurement.

It's typical accounting issue... If you sell to your subsidiary is it counted as sales?

Collapse
 
rafalpienkowski profile image
Rafal Pienkowski

Awesome post.
I really enjoyed reading it. You've achieved big reduce of memory usage, so your mission has been accomplished. I quite sure that I'll use techniques presented by you.

I wish you next successes in your work and I'm waiting for next articles from you.

Cheers.

Collapse
 
gobol profile image
Gobol

Going down the revisions your parser looks more and more like good’ole C implementation. Now, forgive .NET string, get \0 terminated char*’s and we’ve back home :) All we’re trying to say here is that .NET is not very good at simple data processing...

Collapse
 
mgajda profile image
Michał J. Gajda

Should not the compiler do it for you?
You basically wrote the code for parsing CSV with fixed types.
All allocation patterns were straightforward, so compiler should in theory make a reusable buffers, and do away with most allocations.

Collapse
 
artyom profile image
Artyom Yakovenko

Awesome walkthrough! Thanks!

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Artyom,

Glad you enjoyed the article!

Cheers,
Indy

Collapse
 
hasheendeberry profile image
Hasheen DeBerry

Great job refactoring. This was fun to read. I would be interested to see how LINQ performs in this context.

Collapse
 
stevefutcher profile image
Steve Futcher

Great Article! Really got me thinking about how we often take the easy route by default, and maybe there are some simple general optimisations that are drop in replacements.

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Steve,

Glad you enjoyed it! I agree that sometimes the easy route is taken by default. But this particular piece of code was around before I joined the company. Sometimes it is better to choose the easy route and then improve the code if it becomes a problem.

Regards,
Indy

Collapse
 
indy_singh_uk profile image
Indy Singh

Hi Robb,

Thanks for the positive comments, sorry disappoint on the stringly typed data.

Cheers,
Indy

Collapse
 
codadensys profile image
codadensys

Great read. Thank you.