loading...

Text Manipulation is Complicated

entomy profile image Patrick Kelly ・5 min read

This is an adaptation of the article that used to be on my personal blog that I neglected. If you've read this over there, there is new information in this article.

2020-02-02. 02/02/2020. 33 days into the year. 333 days left in the year. Today is Palindrome Day. A day that will never happen again. So let's talk about palindromes.

I've focused on text manipulation and parsing. So it should come as no surprise that I've noticed a lot of misinformation about how to deal with text. There's a lot of bad algorithms out there. Good intentioned, but badly implemented.

What's a Palindrome?

This might seem obvious, right? From what I've read on StackOverflow, programmers certainly think it's obvious. According to most programmers:

A palindrome is a word which can be read the same way in reverse.

So that's simple, just reverse the string and compare it, right?

Trump "Wrong" meme

Let's look at the literary definition of palindrome. That's where this concept comes from, after all.

Merriam-Webster:

a word, verse, or sentence (such as "Able was I ere I saw Elba") or a number (such as 1881) that reads the same backward or forward

Dictionary.com:

a word, line, verse, number, sentence, etc., reading the same backward as forward, as Madam, I'm Adam or Poor Dan is in a droop.

Also consider looking at resources like palindromelist.net for examples. You should immediately notice that a palindrome isn't determined by whether the letters (or numbers) are the same in reverse, but rather, that it is read the same way in reverse.

What's everyone been doing?

Well, only detecting a subset of palindromes: palindromic words. This is one of the smallest subsets of palindromes however.

Why is this the case? Why are all the answers on Stack Overflow and other resources only doing this?

"I don't know" meme

Let's do this right

So let's actually try tackling this to the best that is possible given the current state of text processing capabilities in programming languages and runtimes. You probably started off with something like this:

public static Boolean IsPalindrome(this String @string) {
    return String.Equals(@string, @string.Reverse().Join());
}

This is the most common thing I've seen. Some people tackle casing as well, and might have this:

public static Boolean IsPalindrome(this String @string) {
    return String.Equals(@string.ToLower(), @string.ToLower().Reverse().Join());
}   

Nope. No, no no. No.

The intentions here are good. But don't use ToLower() for case insensitivity. For most of the world this works fine, but for a few select languages, such as German and Turkish, among a few others, you will run into problems. If you can, use a cultural insensitivity flag, otherwise use ToUpper().

public static Boolean IsPalindrome(this String @string) {
    return String.Equals(@string, @string.Reverse().Join(), StringComparison.CurrentcultureIgnoreCase);
}   

That's better. It won't break for specific languages. But it doesn't handle palindromic sentences just yet. The cleanest way I've found for handling this has been to build up a new string, while removing the spaces and punctuation.

There's a faster approach, but it's harder to maintain and reason about. It involves iterating forward and back, keeping two cursors and individually manipulating them to skip over unrelated chars; it's not as easy to get right. But it doesn't even matter because it's still not correct. A function that does not return the correct result is infinitely slow; never arriving at the right answer.

So what's the right way look right? Honestly I still don't have this perfect, but let's cover what I have, and the test cases run against it.

public static Boolean IsPalindrome(this ReadOnlySpan<Char> text) {
    ReadOnlySpan<Char> prepped = PalindromeStrip(text);
    String reversed = Glyph.Reverse(prepped);
    SpanGlyphEnumerator prep = prepped.EnumerateGlyphs();
    StringGlyphEnumerator revr = reversed.EnumerateGlyphs();
    // Now actually check it's a palindrome
    while (prep.MoveNext() && revr.MoveNext()) {
        if (prep.Current.ToUpper() != revr.Current.ToUpper()) {
            return false;
        }
    }
    return true;
}

private static ReadOnlySpan<Char> PalindromeStrip(ReadOnlySpan<Char> text) {
    // First we need to build the string without any punctuation or whitespace or any other unrelated-to-reading characters
    Char[] builder = new Char[text.Length];
    Int32 b = 0;
    Span<Char> glyphChars = new Char[4];
    foreach (Glyph s in text.EnumerateGlyphs()) {
        if (new Letter().Contains(s)) {
            Int32 l = s.EncodeToUtf16(glyphChars);
            while (l > 0) {
                builder[b++] = glyphChars[--l];
            }
        }
    }
    return builder.AsSpan().Slice(0, b);
}

Oh my god you're overcomplicating this, this code is horrendous.

Hol' up. Let's look at the test cases.

[Theory]
[InlineData("", true)]
[InlineData("a", true)]
[InlineData("detartrated", true)]
[InlineData("tattarrattat", true)]
[InlineData("Malayalam", true)]
[InlineData("Was it a car or a cat I saw?", true)]
[InlineData("No 'X' in Nixon", true)]
[InlineData("Able was I ere I saw Elba", true)]
[InlineData("A man, a plan, a canal, Panama!", true)]
[InlineData("Do, O God, no evil deed! Live on! Do good!", true)]
[InlineData("boot", false)]
[InlineData("Baêab", true)]
[InlineData("Bâeâb", true)]
public void IsPalindrome(String source, Boolean expected) => Assert.Equal(expected, source.IsPalindrome());

Think about the original implementations I showed off that nearly every programming is out there doing? How many of these would pass? Most fail. I'm not going to call my implementation perfect: I'm sure there's cases where it fails but shouldn't. In fact, I don't handle cultural/orthographic specific things yet. And there's more cases where my implementation falls short.

Going Forward

It's still clear that identifying palindromes as they are understood in literary studies is a lot more complicated than programmers have been letting on. In fact, all of text processing is plagued with this problem.

There's an enormous number of text algorithms, whether manipulation, processing, or even parsing, which have trouble with this. Graphemes can be achieved through different code units or scalar values. Yet most algorithms work on the code unit or scalar value level. This is unintentional, and only happens to work in a subset of cases; unfortunately, the most common language people are going to test with: English.

Luckily, the work I've been doing on Stringier is designed to make these problems immensely easier to tackle. As I implement more and more concepts, even my own algorithms have been greatly simplified and made more robust.

Text manipulation and processing is complicated.

Discussion

pic
Editor guide