Implementing Unix's compress

#python #algorithms #computerscience #programming

Let’s take a look at the compression algorithm behind Unix’s compress and most .gif compression.

The Lempel Ziv Welch [LZW] algorithm is a greedy lossless compression algorithm that works by replacing recurring patterns with shorter codes in order to save space.

Lossless is a form of compression where no data is lost.

We’ll go over the algorithm and take a look at an implementation in Python.

For simplicity sake, we’ll limit the discussion to encoding ASCII characters.

Encoding

Essentially, if as we’re processing a text file, we’re able to identify recurring phrases, we can substitute those phrases with a code that would be shorter than the original text. Doing this over the entire document gives us the compression we’re after.

LZW relies on a dictionary which maps a phrase to a unique code. The first 256 entries in the dictionary are the single ASCII characters mapped to their numeric code.

For example, at the start, the dictionary might look something like this:

{
  "A": "65",
  "B": "66",
  ....
  "a": "97",
  "b": "98"
}

Now, we’ll iterate through the rest of the text — character by character — building up a phrase / character sequence. If the inclusion of a new character creates a phrase that doesn’t currently exist in our dictionary, we will ignore this breaking character and output the code that matches the previous state of our character sequence.

Then, we’ll add this new phrase with this breaking character into our dictionary and assign it a new code.

For example, let’s say our input text is:

... app [cursor] appl ....

Our dictionary might look something like this:

{
  ...
  "app": 324,
  ....
}

The next portion of text we come across is “appl” which isn’t currently in our dictionary.

So, we’ll output the code for the existing phrase match of “app“ and then we’ll add “appl” to our dictionary with a new code.

The output after this iteration might look something like this:

Dictionary: { … “app”: 324, “appl”: 325 …}

Output: … 324 …

Notice that we only output a code when we encounter a new phrase that isn’t already in our dictionary. This gives us a chance to build up to larger phrases and achieve greater compression.

As the program processes the input and the dictionary grows, we’ll eventually be storing larger and larger strings. The longer the string, the more compression we can get by replacing it with a smaller code instead.

Unlike Huffman Encoding, LZW doesn’t require this dictionary to be included in the compressed data. One of the unique characteristics of this algorithm is that this dictionary can be rebuilt while uncompressing the data. Additionally, given the algorithm’s character by character approach, it’s well-suited for compressing streams of data.

To fully understand where the compression comes from, let’s take a look at a larger example:

34832 : 'Harry'
34833 : 'Potter'
34834 : 'Hogwarts'
34835 : 'magic'

If we save the code 34833 as an unsigned short, for example, it’ll cost 16 bits . However, it’ll let us represent 6 characters * 8 bits = 48 bits worth of information. Generally speaking, the higher the codes climb the more compression we will obtain.

Decoding

The decoding process is really just the opposite of the encoding process. The only change being that we need to re-create the dictionary with the 256 ASCII codes as a preliminary step before the rest of the decompression can begin.

Then, we read through the compressed data and use our code dictionary to translate the codes back to the original text.

Variable Length Encoding

Most implementations of this algorithm use the idea of a variable width code.

This defines an upper-bound for the maximum length of the code — thereby preventing this algorithm from running away with memory. For example, if we chose a 8-bit code as our width, our maximum number of codes would be 256 (2⁸).

If we chose a 12-bit code as our width, we could have up to 4096 codes and so on.

If our dictionary grew to include all 4096 codes, the rest of the document is encoded using the existing dictionary, but no new codes are added.

Code

The compression algorithm is fairly simple to implement.

The following implementation when run on A Tale of Two Cities, results in a compressed file that is 532 KB which is 32% smaller than the original at 777 KB.

The following implementation is an adapted version of this code:

Wrapping Up

This algorithm, though easy enough to understand and implement, isn’t always the right choice. It performs poorly when the text is short and unique; the algorithm favors redundant data.

If you like this technical dive into modern algorithms, feel free to follow me on Twitter, or check out my personal website for more posts.