DEV Community

loading...

Deep Into Auditing Stringier

entomy profile image Patrick Kelly ・6 min read

As part of any major release process, I do full audits of the code base. This was something I learned while contributing to OpenBSD and similar projects, where code quality and security is taken very seriously. So unsurprisingly, as I'm rolling out v4.0 of the Stringier libraries, I've been doing an audit. Since I don't have Core done yet, I'm not ready to talk about the specific functions inside of it. So, apologies for the delays, but it's because there's actually a lot involved in this audit. And that is what we'll be going over, because this is an opportunity to share some useful general optimization knowledge, and voice some criticisms of .NET. But in order to do this, we first need to talk about why this project exists on .NET at all.

Interning is a very powerful optimization. I'll let you read straight from the article, but the general idea is that you're greatly reducing memory pressures, and speeding up common string operations like equality, which is used extensively.

The CTS & CLS has additional benefits. While, sure, we could always bind to a specific languages conventions, and despite this normally only be doing with C, there are languages out there that support multiple languages, some even rather complex. But if the mechanism to do this is standard, then, that's even better.

So that's why I went with .NET for this project. While there's other things that offer cross language compatibility, they are often operating specific, like the Common Language Environment on OpenVMS or Component Object Model on Windows, or are commercial compiler specific, like Elements from RemObjects. So if I'm going to support this kind of thing, and I want to, I either need to roll my own mechanism, do the Ada thing with standardizing import conventions for multiple different languages, or reuse an existing mechanism. Writing on top of .NET just made sense.

So what is this project, exactly? Simply put, Stringier intends to be the most full featured and comprehensive text processing API in the world. First and foremost, it intends to be the most user-friendly and correct. Performance is a very strong goal as well. But it also happens to intend to be a large part of another languages standard library, and even runtime library. More on that in a later post, but this is part of why cross language compatibility is so important.

I mentioned changes. There's a lot of changes coming, hence the major version bump. While working on the Patterns engine, a major source of optimization I had done was reducing the memory allocations as much as possible. In fact, parsing doesn't involve a single memory allocation anywhere on the heap, and barely anything to the stack (just a single reference slice as a ReadOnlySpan). It might be the worlds first allocationless parser framework. But this specific optimization isn't unique to parsing. Remember the whole point of interning? Well reference slicing is another way to accomplish that goal. It only works with contiguous memory, like arrays, but reference slicing is actually more effective than interning in that situation. See, Core was originally constructing new strings all over the place, even when it didn't need to. Many operations actually have their entire result exist within the source text. Trim() is a fantastic example of this. No string interning system will not-allocate a new string for the result. It might be interned, but it's still duplicated. Through reference slicing we can reduce allocations even further. And because allocation onto the heap, a check against the intern pool, and possible internment, are all avoided, the actual functions are faster.

Core is using a new public API design, whereby ReadOnlySpan<Char> is the return type used for every function that does not allocate, and String is used by every function that does allocate. This wouldn't matter much, as ReadOnlySpan<Char> could be used exclusively without changing the downstream dev's experience, but it does serve as nice documentation about whether allocations are occurring or not, baked right into the function signature. To go along with this, all function implementations are required to go into a method which accepts ReadOnlySpan<Char> as its exclusive text-type parameters. But, because any text-type can be rapidly and efficiently converted to a ReadOnlySpan<Char>, overloads are provided which work on other text-types, including String, Char[], and even fat pointers Char*, Int32! This isn't any additional burden on my end, because like I said, they all share the same implementation.

There were numerous operations where StringComparison.CurrentCulture was implicitly used and it shouldn't've been. This is unintended behavior and therefore a bug. While I could handle it the .NET way, where I still do the implicit current-culture comparison, but allow overloads to specified cultures, this is wrong, in my opinion, and only left in because Microsoft so strongly loves backwards compatibility to the point of replicating bugs just because 0.0001% of developers remember it existing on something a given product was inspired from. Remember what I said the biggest product goals were? Correctness. What I do need to figure out, however, is if the default behavior should be ordinal or invariant-culture. But I will have an answer for that come v4 release.

It shouldn't be surprising then, that there's numerous issues with culture, string comparison, etc. going on. There's all sorts of gradual progress in fixing those, and it's a long game that won't be finished in this release. But there's been enough implemented that I can actually start to introduce some of that auxiliary work directly into Core. UnicodeCategory should have had Flags. This has been introduced as Categories and is used to implement the relevant functionality. This is particularly important in that the old way of specifying multiple categories, or broad categories, was tedious and required the use of a case-statement. Now you can just bitwise-or them together. In fact, that's how the broad categories are done. No more "is this a spacing mark, or a non-spacing mark...?" it's just "is this a mark?". Because of how Flags works, you can arbitrarily combine them as well. This actually simplifies many algorithms, and is considerably more useful than you'd expect. It also serves to introduce further improvements to this concept later down the road, through far more granular categorization, possibly based on an updated UTN#36. Speaking of categories, almost every function accepts categories as a parameter, for things like trimming anything of a particular category, whether any character in a string is of a category, etc.

LINQ is a fantastic thing. Now, let's nip the in the butt right now: I'm not going to be magically adding LINQ-expression syntax to C#; I can't without adding particular interfaces to .NET types. But what I can do, is provide overloads of the existing LINQ extension methods, which can be used through fluent syntax. In fact, some of these already existed, but because I wasn't ever thinking about LINQ, I didn't recognize it like I should have. Those are being renamed for better discoverability and comprehension. And because they will use the exact same names as the more-generic LINQ extensions in the .NET libraries, given they way resolution of overloaded methods works, my specialized, and faster, methods will be used when appropriate, without you needing to do anything.

That's not everything I've been noticing, but it's a lot, and is all of the major things. There's a lot going into this, and it really should be considered a major rewrite. Granted, this is all "loose" functions, so it's not entire architectural changes, but Stringier does so much code sharing in order to be efficient and maintainable, that changing anything results in cascades of further changes, especially when new types are introduced to replace old ones. But hey, that's what a thorough unit testing library is for, right? You are unit testing, right?

Discussion

pic
Editor guide