DINH HOANG MY LE

Posted on Feb 11

What I’m Learning in Data Structures: The Algorithm Behind Compression (bzip, etc.)

#algorithms

How Compression Algorithms Actually Work

Today we started learning how real compression algorithms work — from simple encoding to efficient machine-level implementations.

Compression starts with encoding

At first, we looked at a simple idea:

If we have an array like:

[a, b, c]

We can encode each character using its index

So the word bba becomes:

→ 110

Smarter Idea: Frequency Matters

In English:

Some letters appear more often (e, t, a…)
Some are rare (z, q…)

So we can assign shorter codes to frequent letters

That's the idea behind Huffman encoding. But Huffman encoding doesn't care about patterns.

For example:

"the man in the kitchen"

The word "the" appears multiple times. Huffman treats each character independently. It doesn't care that the sequence "the" repeats.

Enter Move-to-Front (MTF)

There is an algorithm that works well when patterns are close together. It's called Move-to-front (MTF).

Imagine we have a list of all characters in order:

a → b → c → d → e → f → g → h → ... → x → y → z
0   1   2   3   4   5   6   7         23  24  25

Now suppose our input string is:

"xyz yy"

We process it one character at a time. For each character:

Output the index of the character in the current list
Move that character to the front of the list

Step by step:

First character: x

Index in list: 23
Output: 23
Move x to front → list now: x a b c d e f g h i j k l m n o p q r s t u v w y z ...

Next: y

Index: 24
Output: 24
Move y to front → list now: y x a b c d e f g h i j k l m n o p q r s t u v w z ...

Next: z

Index: 25
Output: 25
Move z to front → list now: z y x a b c d e f g h i j k l m n o p q r s t u v w ...

Next: space

Index: 26 (in the updated list)
Output: 26
Move space to front

Now the second y:

y → index 1 (because y is now at position 1 in the list)
Output: 1

Third y:

y → index 0 (because y just moved to the front!)
Output: 0

So the encoding for "xyz yy" becomes:

23 24 25 26 1 0

Notice how the last two y's become 1 0 — much smaller numbers than the original 24! This is the power of MTF when similar characters appear close together.

The Missing Piece: Burrows-Wheeler Transform (BWT)

But there's a problem: MTF only works well if similar patterns are already close together in the input.

So now we need an algorithm that puts the same patterns close together so that we can take advantage of MTF. In fact, we need a rule — a specific order to group characters — so that we can reverse it and get the original input.

This is where the Burrows-Wheeler Transform (BWT) comes in. It's a reversible transformation that rearranges the input text to group similar characters together, making it perfect for feeding into MTF.

How BWT Works

Let's use "banana" as an example. We add $ to mark the end: banana$

Step 1: Create all rotations

Rotate one character back and repeat for all characters:

banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana

Step 2: Sort lexicographically

$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

Step 3: Take the last column

The output is: annb$aa

Why BWT is Powerful for English Text

Here's where BWT really shines. Look at what happens when we have text with repeated patterns:

Before sorting:

... the dog ...
... the cat ...
... the duck ...
... the man ...

After sorting lexicographically:

he cat ... t
he dog ... t
he duck ... t
he man ... t

Notice something amazing? After sorting, all the rotations that start with "he" (from "the") group together!

And look at the last column — all the t's are grouped together:

t
t
t
t

How to Reverse BWT (Decoding)

Now the clever part: how do we get back the original string? Let's use "cat" as an example.

After BWT encoding cat$, we get the output: tc$a

Observation 1: Last column + sorting = First column

We have the last column (the encoded output): tc$a

If we sort it, we get the first column:

Last column (encoded): t c $ a
First column (sorted): $ a c t

So now we have:

Row	First	...	Last
0	$	...	t
1	a	...	c
2	c	...	$
3	t	...	a

Observation 2: Last and First are consecutive in the original

Since each row is a rotation, the last character and first character in each row are consecutive in the original input.

For example, from the row at$c:

Last: c
First: a
This tells us: c comes before a → ...ca...

Observation 3: We can start with `$` because we know where the string ends

The complete decoding process:

Start with the row that has $ in the last column (row 2)
- First character: c
- Add c to decoded string → c
Move to the row with c in the last column (row 1)
- First character: a
- Add a to decoded string → ca
Move to the row with a in the last column (row 3)
- First character: t
- Add t to decoded string → cat
Move to the row with t in the last column (row 0)
- First character: $
- We've reached the end marker!

Putting It All Together: The Complete Compression Pipeline

Now we can see how real compression algorithms work by combining BWT and MTF.

The Compression Process

Let's use "banana" as our example:

Step 1: Apply BWT to preprocess and group similar characters

Original input: banana$

After BWT: annb$aa

Notice how the three a's are now grouped together at the end: ...aa, and the two n's are next to each other: nn...

Step 2: Apply Move-to-Front encoding

Now we encode annb$aa using MTF:

Starting list: $ a b c d e f g h i j k l m n o p q r s t u v w x y z

a → index 1, output: 1, move to front
n → index 14, output: 14, move to front
n → index 0 (just moved to front!), output: 0
b → index 3, output: 3, move to front
$ → index 4, output: 4, move to front
a → index 1, output: 1, move to front
a → index 0 (just moved to front!), output: 0

Final MTF output: 1 14 0 3 4 1 0

The magic: Look at those two consecutive n's becoming 14 0, and the two consecutive a's at the end becoming 1 0! The smaller numbers compress much better.

The Decompression Process

To get back the original:

Reverse MTF: 1 14 0 3 4 1 0 → annb$aa
Reverse BWT: annb$aa → banana$

And we're back to the original string!

This is the foundation of real-world compression algorithms like bzip2, which uses exactly this BWT + MTF + Huffman combination to achieve excellent compression ratios on text data.

DEV Community

What I’m Learning in Data Structures: The Algorithm Behind Compression (bzip, etc.)

How Compression Algorithms Actually Work

Compression starts with encoding

Smarter Idea: Frequency Matters

Enter Move-to-Front (MTF)

Step by step:

The Missing Piece: Burrows-Wheeler Transform (BWT)

How BWT Works

Why BWT is Powerful for English Text

How to Reverse BWT (Decoding)

Observation 1: Last column + sorting = First column

Observation 2: Last and First are consecutive in the original

Observation 3: We can start with `$` because we know where the string ends

Putting It All Together: The Complete Compression Pipeline

The Compression Process

The Decompression Process

Top comments (0)

How Compression Algorithms Actually Work

Compression starts with encoding

Smarter Idea: Frequency Matters

Enter Move-to-Front (MTF)

Step by step:

The Missing Piece: Burrows-Wheeler Transform (BWT)

How BWT Works

Why BWT is Powerful for English Text

How to Reverse BWT (Decoding)

Observation 1: Last column + sorting = First column

Observation 2: Last and First are consecutive in the original

Observation 3: We can start with $ because we know where the string ends

Putting It All Together: The Complete Compression Pipeline

The Compression Process

The Decompression Process

Observation 3: We can start with `$` because we know where the string ends