What Tree Data Structure Performs Well with rank() and select()? Self-Balancing Trees with Data Augmentation

DINH HOANG MY LE — Thu, 12 Feb 2026 16:58:58 +0000

When you need efficient rank and select operations on dynamic data, AVL trees (or other self-balancing trees) with data augmentation are your answer.

What are rank() and select()?

These are fundamental order statistics operations:

rank(x): Returns how many elements in the set are smaller than x (or equivalently, what position x would be if we sorted everything)
select(k): Returns the k-th smallest element in the set

Think of it like this: if you have {3, 7, 1, 9, 5}, then rank(7) = 3 (because three elements are smaller than 7), and select(3) = 5 (the 3rd smallest element is 5).

The Key Insight

The naive approach would be to store at each node x how many nodes are smaller than x. This would make rank(x) trivial - just find x and read that value in O(log n) time. But there's a fatal flaw: this direct count is a nightmare to maintain. Every insertion or deletion could cascade updates throughout the tree, destroying the efficiency we're after.

The Solution: Store Subtree Sizes Instead

Instead of storing "how many nodes are smaller than me," each node stores "how many nodes are in my subtree (including me)." This subtle shift changes everything.

Why does this work better? When you insert or delete a node, only the nodes on the path from that node to the root need their size updated - that's just O(log n) nodes in a balanced tree. The rest of the tree's size values remain valid.

With subtree sizes, you can compute rank(x) by summing up the sizes of left subtrees as you traverse down to x. Similarly, select(k) works by navigating based on comparing k with subtree sizes at each step. Both operations remain O(log n).

This is data augmentation at its finest: storing additional information that's cheap to maintain but powerful to query.

What I’m Learning in Data Structures: The Algorithm Behind Compression (bzip, etc.)

DINH HOANG MY LE — Wed, 11 Feb 2026 23:29:18 +0000

How Compression Algorithms Actually Work

Today we started learning how real compression algorithms work — from simple encoding to efficient machine-level implementations.

Compression starts with encoding

At first, we looked at a simple idea:

If we have an array like:

[a, b, c]

We can encode each character using its index

So the word bba becomes:

→ 110

Smarter Idea: Frequency Matters

In English:

Some letters appear more often (e, t, a…)
Some are rare (z, q…)

So we can assign shorter codes to frequent letters

That's the idea behind Huffman encoding. But Huffman encoding doesn't care about patterns.

For example:

"the man in the kitchen"

The word "the" appears multiple times. Huffman treats each character independently. It doesn't care that the sequence "the" repeats.

Enter Move-to-Front (MTF)

There is an algorithm that works well when patterns are close together. It's called Move-to-front (MTF).

Imagine we have a list of all characters in order:

a → b → c → d → e → f → g → h → ... → x → y → z
0   1   2   3   4   5   6   7         23  24  25

Now suppose our input string is:

"xyz yy"

We process it one character at a time. For each character:

Output the index of the character in the current list
Move that character to the front of the list

Step by step:

First character: x

Index in list: 23
Output: 23
Move x to front → list now: x a b c d e f g h i j k l m n o p q r s t u v w y z ...

Next: y

Index: 24
Output: 24
Move y to front → list now: y x a b c d e f g h i j k l m n o p q r s t u v w z ...

Next: z

Index: 25
Output: 25
Move z to front → list now: z y x a b c d e f g h i j k l m n o p q r s t u v w ...

Next: space

Index: 26 (in the updated list)
Output: 26
Move space to front

Now the second y:

y → index 1 (because y is now at position 1 in the list)
Output: 1

Third y:

y → index 0 (because y just moved to the front!)
Output: 0

So the encoding for "xyz yy" becomes:

23 24 25 26 1 0

Notice how the last two y's become 1 0 — much smaller numbers than the original 24! This is the power of MTF when similar characters appear close together.

The Missing Piece: Burrows-Wheeler Transform (BWT)

But there's a problem: MTF only works well if similar patterns are already close together in the input.

So now we need an algorithm that puts the same patterns close together so that we can take advantage of MTF. In fact, we need a rule — a specific order to group characters — so that we can reverse it and get the original input.

This is where the Burrows-Wheeler Transform (BWT) comes in. It's a reversible transformation that rearranges the input text to group similar characters together, making it perfect for feeding into MTF.

How BWT Works

Let's use "banana" as an example. We add $ to mark the end: banana$

Step 1: Create all rotations

Rotate one character back and repeat for all characters:

banana$
anana$b
nana$ba
ana$ban
na$bana
a$banan
$banana

Step 2: Sort lexicographically

$banana
a$banan
ana$ban
anana$b
banana$
na$bana
nana$ba

Step 3: Take the last column

The output is: annb$aa

Why BWT is Powerful for English Text

Here's where BWT really shines. Look at what happens when we have text with repeated patterns:

Before sorting:

... the dog ...
... the cat ...
... the duck ...
... the man ...

After sorting lexicographically:

he cat ... t
he dog ... t
he duck ... t
he man ... t

Notice something amazing? After sorting, all the rotations that start with "he" (from "the") group together!

And look at the last column — all the t's are grouped together:

t
t
t
t

How to Reverse BWT (Decoding)

Now the clever part: how do we get back the original string? Let's use "cat" as an example.

After BWT encoding cat$, we get the output: tc$a

Observation 1: Last column + sorting = First column

We have the last column (the encoded output): tc$a

If we sort it, we get the first column:

Last column (encoded): t c $ a
First column (sorted): $ a c t

So now we have:

Row	First	...	Last
0	$	...	t
1	a	...	c
2	c	...	$
3	t	...	a

Observation 2: Last and First are consecutive in the original

Since each row is a rotation, the last character and first character in each row are consecutive in the original input.

For example, from the row at$c:

Last: c
First: a
This tells us: c comes before a → ...ca...

Observation 3: We can start with `$` because we know where the string ends

The complete decoding process:

Start with the row that has $ in the last column (row 2)
- First character: c
- Add c to decoded string → c
Move to the row with c in the last column (row 1)
- First character: a
- Add a to decoded string → ca
Move to the row with a in the last column (row 3)
- First character: t
- Add t to decoded string → cat
Move to the row with t in the last column (row 0)
- First character: $
- We've reached the end marker!

Putting It All Together: The Complete Compression Pipeline

Now we can see how real compression algorithms work by combining BWT and MTF.

The Compression Process

Let's use "banana" as our example:

Step 1: Apply BWT to preprocess and group similar characters

Original input: banana$

After BWT: annb$aa

Notice how the three a's are now grouped together at the end: ...aa, and the two n's are next to each other: nn...

Step 2: Apply Move-to-Front encoding

Now we encode annb$aa using MTF:

Starting list: $ a b c d e f g h i j k l m n o p q r s t u v w x y z

a → index 1, output: 1, move to front
n → index 14, output: 14, move to front
n → index 0 (just moved to front!), output: 0
b → index 3, output: 3, move to front
$ → index 4, output: 4, move to front
a → index 1, output: 1, move to front
a → index 0 (just moved to front!), output: 0

Final MTF output: 1 14 0 3 4 1 0

The magic: Look at those two consecutive n's becoming 14 0, and the two consecutive a's at the end becoming 1 0! The smaller numbers compress much better.

The Decompression Process

To get back the original:

Reverse MTF: 1 14 0 3 4 1 0 → annb$aa
Reverse BWT: annb$aa → banana$

And we're back to the original string!

This is the foundation of real-world compression algorithms like bzip2, which uses exactly this BWT + MTF + Huffman combination to achieve excellent compression ratios on text data.

DEV Community: DINH HOANG MY LE