Stringier

#textprocessing #csharp #dotnet

Stringier was born from what I do most in programming: text processing. At its core is the subproject and library simply named Core. It holds a large number of methods related to text processing. We'll cover them all in detail as this blog continues; there will be a unique entry for each method. But as I'm still working on a v4.0 release, I'm not ready to do so just yet. But I can take the opportunity to talk about the project over all, and all the different subprojects that make it up.

Stringier aims to solve a number of problems I've encountered over the 15 years I've been programming. Very few languages provide the kind of support for text processing that anyone working extensively in that area expects. And to make matters worse, third party libraries rarely provide efficient algorithms for the stuff they do (more on that in later entries).

So how does this project work?

Firstly, there's extensive and automatically updated documentation here. Being a constantly updated project, the documentation is, itself, constantly improving. There's some lacking areas. There's some unfinished pages. But you can always file a request for more.

At the absolute base, with almost no dependencies on anything else, are the type-specific projects. This includes things like the backport of Rune or new types like Glyph.

Then there's auxiliary projects that used to be apart of Core. These have been yanked out for various reasons related to maintainability, discoverability, updatability, and more. This includes the like Encodings, a reimagining of Encoding and its related types, designed in a much more easy to work with way, that actually utilizes stream decoding, not strictly buffer decoding. The implications of that is hard to explain in a non-dedicated blog post. But furthermore, it's fast and efficient. Run the benchmarks. Seriously, every single subproject that contains unique code development by myself has a benchmarking application utilizing BenchmarkDotNet. You don't need to take my word. Other auxiliary projects include the likes of Metrics and Search, which you almost certainly don't want to use directly. Rather, they are implementations of all sorts of specific algorithms. If you know you need to use Rabin-Karp over Boyer-Moore-Horspool, or Levenshtein over Hamming, then these are what you'd be interested in, because they allow explicit selection of the algorithm. But if you don't know what any of that means, these algorithms are used as part of much friendlier named methods elsewhere throughout the project.

Patterns is a SNOBOL-inspired, pattern matching engine, which takes additional influence from Parser Combinators, Regex, String Scanners, and other parsing frameworks. It's designed to be as declarative as possible (and accomplishes that), allowing for self-optimization. Fundamentally, it's my unique design, as it doesn't fit entirely into any of the above categories. But the implementation is FOSS and very liberally licensed, so feel free to dig into the internals. The implementation is extremely light weight, and still a contender for top-performing engines. It also houses a large scale benchmark showdown between various engines that can be used from .NET, including the .NET Regex engine, PCRE, FParsec, Pidgin, Sprach, and more. So even if you're just curious about how these stack up against each other, this project can provide what you're interested in. Rather unique to Patterns is the fact that it was designed to directly support debugging, and there's even a project to support that called Tracing. Tracing provides an implementation of the trace collection interface used throughout the entire engine, but also, hosts debugger interfaces that I've been developing, and using myself as I continue to use the engine. This engine is primarily meant for development of language parsers, such as programming languages, but it has other uses. When it's finally adapted to take advantage of search algorithms, search/replace functionality will also be possible, putting it at a near feature complete competitor to Regex engines.

Literary serves a drastically different purpose from the more computer/programming oriented libraries. Instead, this library means to provide extensions, much like Core, but instead focused very specifically on literary applications. Palindrome detection is a big example, as it's actually substantially more complicated than just reversing a string and comparing it, case insensitive. But it also includes things relevant to constrained writing or poetry, like detection of a heterogram, isogram, lipogram, and more. This library is also table driven, so every algorithm will work for any language that has an entry. This means adding support for your language is as easy as adding in some information about your language. All the algorithms will then "just work". But not like the Todd Howard version; they actually work.

As you can tell, Stringier is clearly a large project, with a scope far and beyond what most text processing libraries dare tackle. But there's more in the works. Before that, however, let's talk about usage and contributions.

Using Stringier is straightforward. The majority of the libraries are BSD-3-Clause licensed, so extremely liberal licensing. The closed source components are still free to use for both personal and commercial purposes; they just aren't clear about implementation details because I've been noticing my work possibly showing up elsewhere, unattributed.

Contributing is, hopefully, straight forward as well. Contributions are welcome, and I'll work with you on getting them accepted. There's a lot to these libraries, and I'm glad for any help I can get. As part of the v4.0 audit and release process, I'll be adding a CONTRIBUTING.md to each repo, but the general gist is this: file an issue describing what changes you'd like; fork the repo; create a branch to do your work in; do your work; add unit tests for your work, and make sure they pass; add benchmarks for your work; file a pull request; party hard.

So what's still in the pipeline?

Well, as mentioned in previous blog posts, the character types that already exist still don't entirely get us where we'd like to be, so there's others that are going to be layered on top, until eventually that goal is accomplished. Something for ligatures is next, and I'll see after that if there's even more to be done. .NET Streams are, unlike much of .NET, an abomination. I'll cover that in a relevant blog post, but oh my god is the model problematic and unusable in serious applications. So, there's an entire reimagining of a Stream API as a part of Streams. It's entirely closed source, but so you'll have to take my word on parts of this, but it's designed super well. Like strict adherence to SOLID design principals and addresses all of the concerns I have with the standard .NET Stream API. When that's done, and it's required for the v4.0 release, Patterns will be adapted to also be capable of stream parsing, another major advantage; the implementation already supports this, it just can't through the .NET Stream API because of problems. Furthermore, additional API's could potentially be built on top of Streams, such as a specialized streams for certain content, like an HTML stream for reading/writing HTML, and so on. The design was, unlike the .NET Stream API, designed entirely around working with text and not bytes. I'm not entirely sure where this goes, but I'll keep adding things as I actually need them, not just as my mind comes up with clever or neat ideas.

More practically, Stringier is being used to develop some commercial products, not limited to an extension for Word for literary purposes, and another project called Langly, which is a DSL for writing language parsers using the various utilities within Stringier for both its implementation and its runtime, but uniquely outputting .NET assemblies which can easily be consumed and utilities by any CLS compliant .NET language, C#, VB, and F# included.

DEV Community

Stringier

Top comments (0)