Vaidehi Joshi

Posted on Feb 10, 2018 • Originally published at Medium on Aug 7, 2017

Compressing Radix Trees Without (Too Many) Tears

#computerscience #datastructures

As it turns out, there comes a turning point in researching computer science that you think to yourself, “Oh wow, no one really knows how to explain this.” (And then of course, if you’re me, you think to yourself: “Oh noâ€Š–â€Šhow am I going to explain this?!”.)

I have some good news and some badnews.

Bad news first: I finally hit that turning point. This week’s topic is one that I felt should have been easy enough to research and learn about; however, it was so much tricker than I thought! In fact, if there is one topic that has brought me to the brink of tears in this series, it is, without a doubt, the one that I’m going to share with you today. Here’s the good news though: I powered through, and after an uphill battle (which mostly consisted of me furiously Googling for answers), I finally started to understand not just how this concept works, but also just how cool itis!

Today’s topic is all about compressing things. For example, I’m going to attempt to compress a deeply complicated topic, which people have spent their entire careers developing and understanding, down into one single post. So, what else are we compressing? Time to findout.

Hardcore (trie) compression

Last week, we learned about a powerful structure called a trie, which powers features that we interact with everyday (such as autocomplete), and is used for implementing matching algorithms, and sometimes even radix sort. We know that a trie is a tree-like data structure whose nodes store the letters of an alphabet, and that words and strings can be retrieved from the structure by traversing down a branch path of thetree.

We also learned that tries are great examples of tradeoffs. They make it easy to retrieve words, but they also take up a lot of memory and space with empty (null) pointers. Last week, we pretty much accepted this as a fact: tries are always going to take up a lot memory and space as they grow in size. But what if there was a way for us to cut down on some of the space and memory that a standard trieuses?

Well…surprise! It turns out there is a way. And we’re finally ready to learn aboutit.

In order for us to undertand how to cut down on the amount of space that our trie uses, we need to be able to easily see what exactly is responsible for taking upsapce.

Imagine that we have a standard trie that represents a group of keys: ["deck", "did", "doe", "dog","doge", "dogs"]. Remember that it’s not really an array of strings that we’re dealing with hereâ€Š–â€Šeach of these strings have values, so it’s more like a hash/dictionary, where each key has avalue:

{ 
  "deck": someValue,
  "did": someValue,
  "doe": someValue,
  "dog": someValue,
  "doge": someValue,
  "dogs": someValue
}

In the illustration shown below, we’ve transformed those keys into a standardtrie.

In some standard tries, we repeat ourselves.

The first thing that we’ll notice when we look at this trie is that there are two keys for which we have redundant nodes as well as a redundant chain ofedges.

A redundant node is one that takes up an undue amount of space because it only has one child node of its own. We’ll see that for the key "deck", the node for the character "e" is redudant, because it only has a single child node, but we still have to initialize an entire node, with all of its pointers, as aresult.

Similarly, the edges that connect the key "did" are redundant, as they connect redundant nodes that don’t really all need to be initialized, since they each have only one child node of theirown.

In some standard triesâ€Š–â€Šlike the one we’re dealing with aboveâ€Š–â€Šwe often find that we’re repeating ourselves quite abit.

The redundancy of a standard trie comes from the fact that we are repeating ourselves by allocating space for nodes or edges that contain only one possible string or word. Another way to to think about is that we repeat ourselves by allocating a lot of space for something that only has one possible branchpath.

This is where compression comes in. We can compress our trie so that we neither repeat ourselves, nor use up more space than is necessary. A compressed trie is one that has been compacted down to savespace.

The single most important rule when it comes to knowing what we can compress comes down to this: each internal node (every parent node) must have two or more child nodes. If a parent has two child nodes, which is at two branch paths to potential leaf nodes, then it doesn’t need to be compressed, since we actually need to allocate space and memory for both of these branchpaths.

However, if a parent node only has one child nodeâ€Š–â€Šthat is to say, if it only has one possible branch path to a leaf nodeâ€Š–â€Šthen it can be compressed. In order to do the work of “compacting” the trie, each node that is the only child of its parent node is merged into its parent. The parent node and the single-child node are fused together, as are the values that theycontain.

We can start compacting our standard trie example from earlier and compress the “single branch paths” likethis:

We’ll notice that we’ve compacted the nodes that formerly contained "e", "c", "k" so that the node "d" has a reference to "eck" as a single string contained in one node, rather than three. It’s a similar story with the key "did", which has been compressed. The internal parent nodes that haven’t been compressed have more than one branch path; however, those internal nodes with single child nodes have been compacted down, and we’ve reduced the number of nodes we previously needed.

Radix trees

The idea of “trie compression” is a concept that is well-known enough to warrant its own name. Compressed tries are also known as radix trees , radix tries , or compact prefix trees. (I know, I know, first it was “tries”, now we’re back to “trees”, I don’t know why no one can agree on what to call anything!)

A compressed trie is also known as a radixÂ tree

All of these terms really just refer to one thing: a space-optimized version of a standard trie. Unlike regular tries, the references/edges/pointers of a radix tree can hold a sequence of a string, and not just a single character element.

How tries are implemented under theÂ hood

It’s important to point out that, even though I have been illustrating tries as though the string characters of a word live in the word, we should remind ourselves quickly how tries actually work. Remember that a single node in a trie contains an array with references, and avalue.

The illustration above is a far more accurate representation of what’s going on under the hood of a radix tree. Notice how a both a single-character value and a string value are all contained in the references to other nodes. When the string terminates, the last reference points to an empty value/null.

A standard trie versus a radix tree representation in nodeÂ count.

Since we already know how to compress a standard trie, we should probably figure out just how useful the work of compressing a trie actually is. In order to understand how a radix tree is more useful in terms of saving memory, we can compare our two triesâ€Š–â€Šthe standard trie we started off with, and the radix tree that we converted itinto.

We can see that in our standard trie, we needed 11 nodes in order to represent the group of keys: ["deck", "did", "doe", "dog","doge", "dogs"].

However, when we compressed our standard trie down into a radix tree such every single child node was merged with its parent, we only needed 8 nodes to represent that same group of keys. Recall that every time we intialize a node, we also need to initialize an array with (usually) 26 keys, one for each of the 26 letters of the English alphabet. By eliminating 3 nodes in the process of compression, we saved that much space and memory that we would have needed to allocate otherwise.

We can also see the power of compression in radix trees when it comes to accessing and retrieving keys. In the example shown here, we’ve slimmed down our radix tree so that we’re just working with three keys now: "dog", "doge", and "dogs". Each of these three keys has a value, so the structure that we’re really trying to represent here looks likethis:

{ 
  "dog": 5,
  "doge": 9,
  "dogs": 12
}

Since we now know how to write and represent these key value pairs in radix tree format, we can draw out what that radix tree would looklike.

Node access and key retrieval in a standard trie vs radixÂ tree

We have an empty string as the value of our root node, with a reference to the string "dog", which points to a single parent node, with a value of5.

When we had written this as standard trie earlier, we needed to access 5 different nodes (including the root node) in order to retrieve the value for the node at the value ending in "g", which is the last character of"dog".

But, in a radix tree, since "dog" is a shared prefix, we can condense it down into a single root node, which eliminates some of the redundant nodes that we had before. Now, we only need to access 3 nodes in order to access the value for the key "dog". Remember that even though we have a single node at the reference point for "dog", we still need an additional terminating node with a null reference in order to store the actual value for thiskey.

The power of radix trees comes from the fact that we compress down substrings based on their prefixes. Even from just this small example, we can see how the substring "dog" here can save two nodes and their corresponding space by compressing the trie into its radix treeformat.

Back to bit and bytebasics

We’ve answered many questions about radix trees, but there’s one thing that still remains a mystery: what does the radix have to do with the tree? In order to answer this question, we need to get back to the basics a littlebit.

Some of the very first core concepts that we learned about were bits, bytes, and building with binary. We’ve run into binary a few times over the course of this series, and guess what? It’s backagain!

In order to understand the radix of a tree, we must understand how tries are read by our machines. In radix trees, keys are read in bits, or binary digits. They are compared r bits at a time, where 2 to the power of r is the radix of thetree.

A trie’s keys could be read and processed a byte at a time, half a byte at a time, or two bits at a time. However, there is one particular type of radix tree that processes keys in a really interesting way, called a PATRICIAtree.

The PATRICIA tree was created in 1968 by Donald R. Morrison, who coined the acronym based on an algorithm he created for retrieving information efficiently from tries; PATRICIA stands for “Practical Algorithm To Retrieve Information Coded In Alphanumeric”.

The reason that PATRICIA trees are so interesting are because of the way that they process keys. And, as we’ve learned, the way that key are processed are tied directly to the radix of a radixtree.

The most important thing to remember about a PATRICIA tree is that its radix is 2. Since we know that the way that keys are compared happens r bits at a time, where 2 to the power of r is the radix of the tree, we can use this math to figure out how a PATRICIA tree reads akey.

Since the radix of a PATRICIA tree is 2, we know that r must be equal to 1, since 2Â¹ = 2. Thus, a PATRICIA tree processes its keys one bit at atime.

PATRICIA trees read keys 1 bit at aÂ time

Because a PATRICIA tree reads its keys in streams of bits, comparing 1 bit at a time, the entire tree is built up to represent binary digits. If you remember learning about binary search trees, you remember that a binary tree can only have two children, with the value of the left node being less than the value of the right node. So, in a binary radix tree, the right node is always used to represent binary digits (or bits) of 1, and the left node is used to represent bits of0.

Because of this, each node in a PATRICIA tree has a 2-way branch, making this particular type of radix tree a binary radix tree. This is more obvious with an example, so let’s look at onenow.

Let’s say that we want to turn our original set of keys, ["dog", "doge", "dogs"] into a PATRICIA tree representation. Since a PATRICIA tree reads keys one bit at a time, we’ll need to convert these strings down into binary so that we can look at them bit bybit.

dog: 01100100 01101111 01100111
doge: 01100100 01101111 01100111 01100101
dogs: 01100100 01101111 01100111 01110011

Notice how the keys "doge" and "dogs" are both substrings of "dog". The binary representation of these words is the exact same up until the 25th digit. Interestingly, even "doge" is a substring of "dogs"; the binary representation of both of these two words is the same up until the 28thdigit!

Pretty cool, right!? Who knew that "doge" was a binary substring of"dogs"?

Okay, so since we know that "dog" is a prefix of "doge", we will compare them bit by bit. The point at which they diverge is at bit 25, where "doge" has a value of 0. Since we know that our binary radix tree can only have 0’s and 1’s, we just need to put "doge" in the correct place. Since it diverges with a value of 0, we’ll add it as the left child node of our root node"dog".

Now we’ll do the same thing with "dogs". Since "dogs" differs from its binary prefix "doge" at bit 28, we’ll compare bit by bit up until thatpoint.

At bit 28, "dogs" has a bit value of 1, while "doge" has a bit value of 0. So, we’ll add "dogs" as the right child of"doge".

This is, of course, a simplified version of Donald R. Morrison’s algorithm, which is far more complex. But ultimately, that’s what’s happening at a very low level! His Abstract on Practical Algorithm To Retrieve Information Coded in Alphanumeric succinctly explains the basic idea behind PATRICIA and where it has beenused:

PATRICIA is an algorithm which provides a flexible means of storing, indexing, and retrieving information in a large file, which is economical of index space and of reindexing time. It retrieves information in response to keys furnished by the user with a quantity of computation which has a bound which depends linearly on the length of keys and the number of their proper occurrences and is otherwise independent of the size of the library. It has been implemented in several variations as FORTRAN programs for the CDC-3600, utilizing disk file storage of text. It has been applied to several large information-retrieval problems and will be applied toothers.

Radix and PATRICIA trees can be a difficult structures to understand because they involve abstract concepts, but leverage very low-level (quite literally binary)things.

To be totally honest, this was the first topic in this series where I truly struggled to understand a concept. But when I finally did comprehend how things were working, it was pretty cool to see how small building blocks are the very things that make up bigger constructs that power parts of the ourworld.

But if this topic makes you want to cry, don’t worryâ€Š–â€Šit nearly brought me to tears,too.

I almost cried while writing about radix trees so that you don’t haveÂ to!

Resources

It’s pretty tricky to find approachable resources on radix trees, but if you dig deep enough around the internet, they do exist. Luckily, you don’t need to dig for them, because I already did! Here are some good places to start reading if you want to learn more about binary radix and PATRICIAtrees.