How Compression Algorithms Actually Work
Today we started learning how real compression algorithms work — from simple encoding to efficient machine-level implementations.
Compression starts with encoding
At first, we looked at a simple idea:
If we have an array like:
[a, b, c]
We can encode each character using its index
So the word bba becomes:
→ 110
Smarter Idea: Frequency Matters
In English:
- Some letters appear more often (e, t, a…)
- Some are rare (z, q…)
So we can assign shorter codes to frequent letters
That's the idea behind Huffman encoding. But Huffman encoding doesn't care about patterns.
For example:
"the man in the kitchen"
The word "the" appears multiple times. Huffman treats each character independently. It doesn't care that the sequence "the" repeats.
Enter Move-to-Front (MTF)
There is an algorithm that works well when patterns are close together. It's called Move-to-front (MTF).
Imagine we have a list of all characters in order:
a → b → c → d → e → f → g → h → ... → x → y → z
0 1 2 3 4 5 6 7 23 24 25
Now suppose our input string is:
"xyz yy"
We process it one character at a time. For each character:
- Output the index of the character in the current list
- Move that character to the front of the list
Step by step:
First character: x
- Index in list: 23
- Output: 23
- Move
xto front → list now:x a b c d e f g h i j k l m n o p q r s t u v w y z ...
Next: y
- Index: 24
- Output: 24
- Move
yto front → list now:y x a b c d e f g h i j k l m n o p q r s t u v w z ...
Next: z
- Index: 25
- Output: 25
- Move
zto front → list now:z y x a b c d e f g h i j k l m n o p q r s t u v w ...
Next: space
- Index: 26 (in the updated list)
- Output: 26
- Move space to front
Now the second y:
-
y→ index 1 (becauseyis now at position 1 in the list) - Output: 1
Third y:
-
y→ index 0 (becauseyjust moved to the front!) - Output: 0
So the encoding for "xyz yy" becomes:
23 24 25 26 1 0
Notice how the last two y's become 1 0 — much smaller numbers than the original 24! This is the power of MTF when similar characters appear close together.
The Missing Piece: Burrows-Wheeler Transform (BWT)
But there's a problem: MTF only works well if similar patterns are already close together in the input.
So now we need an algorithm that puts the same patterns close together so that we can take advantage of MTF. In fact, we need a rule — a specific order to group characters — so that we can reverse it and get the original input.
This is where the Burrows-Wheeler Transform (BWT) comes in. It's a reversible transformation that rearranges the input text to group similar characters together, making it perfect for feeding into MTF.
How BWT Works
Let's use "banana" as an example. We add $ to mark the end: banana$
Step 1: Create all rotations
Rotate one character back and repeat for all characters:
banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana
Step 2: Sort lexicographically
$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba
Step 3: Take the last column
The output is: annb$aa
Why BWT is Powerful for English Text
Here's where BWT really shines. Look at what happens when we have text with repeated patterns:
Before sorting:
... the dog ...
... the cat ...
... the duck ...
... the man ...
After sorting lexicographically:
he cat ... t
he dog ... t
he duck ... t
he man ... t
Notice something amazing? After sorting, all the rotations that start with "he" (from "the") group together!
And look at the last column — all the t's are grouped together:
t
t
t
t
How to Reverse BWT (Decoding)
Now the clever part: how do we get back the original string? Let's use "cat" as an example.
After BWT encoding cat$, we get the output: tc$a
Observation 1: Last column + sorting = First column
We have the last column (the encoded output): tc$a
If we sort it, we get the first column:
Last column (encoded): t c $ a
First column (sorted): $ a c t
So now we have:
| Row | First | ... | Last |
|---|---|---|---|
| 0 | $ | ... | t |
| 1 | a | ... | c |
| 2 | c | ... | $ |
| 3 | t | ... | a |
Observation 2: Last and First are consecutive in the original
Since each row is a rotation, the last character and first character in each row are consecutive in the original input.
For example, from the row at$c:
- Last:
c - First:
a - This tells us:
ccomes beforea→...ca...
Observation 3: We can start with $ because we know where the string ends
The complete decoding process:
- Start with the row that has
$in the last column (row 2)- First character:
c - Add
cto decoded string →c
- First character:
- Move to the row with
cin the last column (row 1)- First character:
a - Add
ato decoded string →ca
- First character:
- Move to the row with
ain the last column (row 3)- First character:
t - Add
tto decoded string →cat
- First character:
- Move to the row with
tin the last column (row 0)- First character:
$ - We've reached the end marker!
- First character:
Putting It All Together: The Complete Compression Pipeline
Now we can see how real compression algorithms work by combining BWT and MTF.
The Compression Process
Let's use "banana" as our example:
Step 1: Apply BWT to preprocess and group similar characters
Original input: banana$
After BWT: annb$aa
Notice how the three a's are now grouped together at the end: ...aa, and the two n's are next to each other: nn...
Step 2: Apply Move-to-Front encoding
Now we encode annb$aa using MTF:
Starting list: $ a b c d e f g h i j k l m n o p q r s t u v w x y z
-
a→ index 1, output:1, move to front -
n→ index 14, output:14, move to front -
n→ index 0 (just moved to front!), output:0 -
b→ index 3, output:3, move to front -
$→ index 4, output:4, move to front -
a→ index 1, output:1, move to front -
a→ index 0 (just moved to front!), output:0
Final MTF output: 1 14 0 3 4 1 0
The magic: Look at those two consecutive n's becoming 14 0, and the two consecutive a's at the end becoming 1 0! The smaller numbers compress much better.
The Decompression Process
To get back the original:
- Reverse MTF:
1 14 0 3 4 1 0→annb$aa - Reverse BWT:
annb$aa→banana$
And we're back to the original string!
This is the foundation of real-world compression algorithms like bzip2, which uses exactly this BWT + MTF + Huffman combination to achieve excellent compression ratios on text data.
Top comments (0)