DEV Community

DINH HOANG MY LE
DINH HOANG MY LE

Posted on

What I’m Learning in Data Structures: The Algorithm Behind Compression (bzip, etc.)

How Compression Algorithms Actually Work

Today we started learning how real compression algorithms work — from simple encoding to efficient machine-level implementations.

Compression starts with encoding

At first, we looked at a simple idea:

If we have an array like:

[a, b, c]

Enter fullscreen mode Exit fullscreen mode

We can encode each character using its index

So the word bba becomes:

→ 110

Enter fullscreen mode Exit fullscreen mode

Smarter Idea: Frequency Matters

In English:

  • Some letters appear more often (e, t, a…)
  • Some are rare (z, q…)

So we can assign shorter codes to frequent letters

That's the idea behind Huffman encoding. But Huffman encoding doesn't care about patterns.

For example:

"the man in the kitchen"

Enter fullscreen mode Exit fullscreen mode

The word "the" appears multiple times. Huffman treats each character independently. It doesn't care that the sequence "the" repeats.

Enter Move-to-Front (MTF)

There is an algorithm that works well when patterns are close together. It's called Move-to-front (MTF).

Imagine we have a list of all characters in order:

a → b → c → d → e → f → g → h → ... → x → y → z
0   1   2   3   4   5   6   7         23  24  25

Enter fullscreen mode Exit fullscreen mode

Now suppose our input string is:

"xyz yy"

Enter fullscreen mode Exit fullscreen mode

We process it one character at a time. For each character:

  1. Output the index of the character in the current list
  2. Move that character to the front of the list

Step by step:

First character: x

  • Index in list: 23
  • Output: 23
  • Move x to front → list now: x a b c d e f g h i j k l m n o p q r s t u v w y z ...

Next: y

  • Index: 24
  • Output: 24
  • Move y to front → list now: y x a b c d e f g h i j k l m n o p q r s t u v w z ...

Next: z

  • Index: 25
  • Output: 25
  • Move z to front → list now: z y x a b c d e f g h i j k l m n o p q r s t u v w ...

Next: space

  • Index: 26 (in the updated list)
  • Output: 26
  • Move space to front

Now the second y:

  • y → index 1 (because y is now at position 1 in the list)
  • Output: 1

Third y:

  • y → index 0 (because y just moved to the front!)
  • Output: 0

So the encoding for "xyz yy" becomes:

23 24 25 26 1 0

Enter fullscreen mode Exit fullscreen mode

Notice how the last two y's become 1 0 — much smaller numbers than the original 24! This is the power of MTF when similar characters appear close together.

The Missing Piece: Burrows-Wheeler Transform (BWT)

But there's a problem: MTF only works well if similar patterns are already close together in the input.

So now we need an algorithm that puts the same patterns close together so that we can take advantage of MTF. In fact, we need a rule — a specific order to group characters — so that we can reverse it and get the original input.

This is where the Burrows-Wheeler Transform (BWT) comes in. It's a reversible transformation that rearranges the input text to group similar characters together, making it perfect for feeding into MTF.

How BWT Works

Let's use "banana" as an example. We add $ to mark the end: banana$

Step 1: Create all rotations

Rotate one character back and repeat for all characters:

banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana

Enter fullscreen mode Exit fullscreen mode

Step 2: Sort lexicographically

$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

Enter fullscreen mode Exit fullscreen mode

Step 3: Take the last column

The output is: annb$aa

Why BWT is Powerful for English Text

Here's where BWT really shines. Look at what happens when we have text with repeated patterns:

Before sorting:

... the dog ...
... the cat ...
... the duck ...
... the man ...
Enter fullscreen mode Exit fullscreen mode

After sorting lexicographically:

he cat ... t
he dog ... t
he duck ... t
he man ... t
Enter fullscreen mode Exit fullscreen mode

Notice something amazing? After sorting, all the rotations that start with "he" (from "the") group together!

And look at the last column — all the t's are grouped together:

t
t
t
t
Enter fullscreen mode Exit fullscreen mode

How to Reverse BWT (Decoding)

Now the clever part: how do we get back the original string? Let's use "cat" as an example.

After BWT encoding cat$, we get the output: tc$a

Observation 1: Last column + sorting = First column

We have the last column (the encoded output): tc$a

If we sort it, we get the first column:

Last column (encoded): t c $ a
First column (sorted): $ a c t
Enter fullscreen mode Exit fullscreen mode

So now we have:

Row First ... Last
0 $ ... t
1 a ... c
2 c ... $
3 t ... a

Observation 2: Last and First are consecutive in the original

Since each row is a rotation, the last character and first character in each row are consecutive in the original input.

For example, from the row at$c:

  • Last: c
  • First: a
  • This tells us: c comes before a...ca...

Observation 3: We can start with $ because we know where the string ends

The complete decoding process:

  1. Start with the row that has $ in the last column (row 2)
    • First character: c
    • Add c to decoded string → c
  2. Move to the row with c in the last column (row 1)
    • First character: a
    • Add a to decoded string → ca
  3. Move to the row with a in the last column (row 3)
    • First character: t
    • Add t to decoded string → cat
  4. Move to the row with t in the last column (row 0)
    • First character: $
    • We've reached the end marker!

Putting It All Together: The Complete Compression Pipeline

Now we can see how real compression algorithms work by combining BWT and MTF.

The Compression Process

Let's use "banana" as our example:

Step 1: Apply BWT to preprocess and group similar characters

Original input: banana$

After BWT: annb$aa

Notice how the three a's are now grouped together at the end: ...aa, and the two n's are next to each other: nn...

Step 2: Apply Move-to-Front encoding

Now we encode annb$aa using MTF:

Starting list: $ a b c d e f g h i j k l m n o p q r s t u v w x y z

  • a → index 1, output: 1, move to front
  • n → index 14, output: 14, move to front
  • n → index 0 (just moved to front!), output: 0
  • b → index 3, output: 3, move to front
  • $ → index 4, output: 4, move to front
  • a → index 1, output: 1, move to front
  • a → index 0 (just moved to front!), output: 0

Final MTF output: 1 14 0 3 4 1 0

The magic: Look at those two consecutive n's becoming 14 0, and the two consecutive a's at the end becoming 1 0! The smaller numbers compress much better.

The Decompression Process

To get back the original:

  1. Reverse MTF: 1 14 0 3 4 1 0annb$aa
  2. Reverse BWT: annb$aabanana$

And we're back to the original string!

This is the foundation of real-world compression algorithms like bzip2, which uses exactly this BWT + MTF + Huffman combination to achieve excellent compression ratios on text data.

Top comments (0)