For a long time I've been interested in learning about information theory. I even have books on the subject that I've tried to read, but as soon as they step out of the analogies and into the maths I get lost.
Never-the-less I have managed to understand one small part of it. The one thing I got my head around is that because standard ways of storing certain types of information (e.g. ASCII character data) have overhead (e.g. every ASCII character takes up 1 byte) and if your document doesn't need all of the overhead (e.g. you only use the 'a' to 'z' ASCII characters), you can remap the information to have a smaller footprint (e.g. each character can take up ¼ of a byte).
In order to prove to myself that I understand this idea I have written a little script to compress text files in python 3:
A exploratory look at how to compress files.
(Requires python 3)
Currently there is only text file compression available, via tc.py
usage: tc.py [-h] [-d] file name positional arguments: file name The name of the file to be processed. optional arguments: -h, --help show this help message and exit -d Decompress the file (defaults to compressing).
It is certainly not as good as ¼ compression. In fact the compression is barely noticeable, but it is there. Let's take a look at how it works (
First we have to figure out the unique characters in the document:
with open(self.filename, 'r') as textfile: text = textfile.read() chars = ''.join(set(text))
set data type makes this very easy. We turn the set back into a string because we need to preserve the order of the characters.
From this we can figure out how many bits we need (the bit length) to represent each character:
bl = len(chars).bit_length()
After this I fairly arbitrarily decided to compresses the characters into a list of ~64 bit numbers. In hindsight, python would allow me to fit all the characters into an integer of arbitrary size, so I didn't need to make a list, but that's what I've done:
max_inserts = 64//bl; current_insert = 0 value = '' values =  for c in text: n = chars.index(c) + 1 b = (bin(n)[2:]).zfill(bl) value += b current_insert += 1 if current_insert == max_inserts: values.append(int(value, 2)) current_insert = 0 value = '' if current_insert < max_inserts and current_insert > 0: values.append(int(value, 2))
That's a dense chunk of code, let's break it down. So first we set some initial values.
max_inserts tells us how many characters we can fit in a 64 bit integer.
current_insert is how many characters we've inserted into the integer.
value is the integer, represented as a string of 1s and 0s.
values is the list of integers.
Next we go through the text of the file one character at a time. First we get the index of the character in the list of unique characters. Then we turn that number into its binary representation (zero filling to the bit length).
The binary representations all need to be the same length so that we can extract the data later on from the file. If the values were different length there'd need to be a way to separate them, which would pad the values anyway, or there would need to be some complex way to put the values together, and I couldn't think of one. In fact, if you look through the git log of the repo, you'll see I tried different things until I got here.
Once the index is in binary, we append it to the current integer. Then we check to see if the current integer is long enough. If it is we add it to the list of integers and reset for the next integer. Right at the end we add the last integer we were working on in case we missed it.
That's it. Decompression is basically the same in reverse.
I know there is a lot more to compression than this, I just don't understand it (yet?). I do want to improve this to work with binary files. I don't think that should be too difficult. After that I think the next step should be to figure out how to compress files based on how often a pattern appears. The idea being that the character or group of bytes that appears most often should be assigned the smallest value, and so on, but I haven't figured out how to store that information in a compressed way.