SEN LLC

Posted on Jun 21

Implementing Huffman Coding in Go — Merging the Two Rarest Nodes Builds the Optimal Prefix-Free Code

#go #algorithms #compression #cli

A CLI that compresses text with Huffman coding and reports the ratio, in Go. The essence of Huffman coding: give frequent symbols short bit strings, rare ones long ones, and make the code prefix-free — no code is a prefix of another — so a concatenated bit stream decodes unambiguously with no separators. The hinge: a greedy construction that just merges the two lowest-frequency nodes repeatedly produces the optimal code tree — and pinning that with property tests.

📦 GitHub: https://github.com/sen-ltd/huffman-cli

(It's a CLI — no live demo. Run with go run or Docker.)

What "prefix-free" buys you

Fixed-width codes (8-bit ASCII) decode without separators. Go variable-length and "where does one symbol end?" becomes the problem: with a=0, b=01, the stream 001 is ambiguous.

Prefix-free removes the ambiguity. Since no code is a prefix of another, you read the bit stream left to right, walk the code tree, and the moment you hit a leaf you've finished exactly one symbol. Unique decoding, zero separators.

The hinge: merge the two rarest nodes

The tree construction is a surprisingly simple greedy algorithm:

Make each symbol a leaf weighted by its frequency.
Put them all in a min-heap.
Pop the two lowest-frequency nodes, merge them under a parent (frequency = sum), push it back.
Repeat until one node remains.

for q.Len() > 1 {
    a := heap.Pop(q).(*node)   // lowest
    b := heap.Pop(q).(*node)   // next lowest
    heap.Push(q, &node{freq: a.freq + b.freq, left: a, right: b})
}

Left edge adds 0, right adds 1; each leaf's root-to-leaf path is its code:

var walk func(n *node, prefix string)
walk = func(n *node, prefix string) {
    if n.leaf { codes[n.sym] = prefix; return }
    walk(n.left, prefix+"0")
    walk(n.right, prefix+"1")
}

Why it's optimal: merging the two rarest nodes first pushes them to the deepest positions (longest codes), while frequent symbols, merged late, stay shallow (short codes). "Rarer ⇒ deeper" is achieved greedily, minimizing average code length (approaching the Shannon entropy).

Pin the properties with tests

Huffman correctness is best guarded by properties, not specific output values.

1. Prefix-free

For every pair, neither is a prefix of the other:

func TestPrefixFree(t *testing.T) {
    codes := BuildCodes(Frequencies("abracadabra"))
    for i := range list {
        for j := range list {
            if i != j && hasPrefix(list[i], list[j]) {
                t.Errorf("%q is a prefix of %q — not prefix-free", list[j], list[i])
            }
        }
    }
}

2. Frequent ⇒ shorter

A higher-frequency symbol never gets a longer code than a rarer one:

freq := map[rune]int{'a': 5, 'b': 2, 'c': 1}
codes := BuildCodes(freq)
if len(codes['a']) > len(codes['b']) { t.Error(...) }  // 'a' must be shortest

3. Round-trip

decode(encode(text)) == text for ASCII, Unicode, and single-symbol input:

for _, text := range []string{"abracadabra", "mississippi", "日本語のテキストも符号化できる"} {
    codes := BuildCodes(Frequencies(text))
    bits, _ := Encode(text, codes)
    got, _ := Decode(bits, codes)
    if got != text { t.Errorf("round trip failed: %q -> %q", text, got) }
}

Frequencies are counted per rune, so Unicode works.

4. Deterministic

Identical input always yields an identical table. Ties (equal frequency) break on a stable insertion order:

func (p pq) Less(i, j int) bool {
    if p[i].freq != p[j].freq { return p[i].freq < p[j].freq }
    return p[i].order < p[j].order  // equal freq → deterministic by insertion
}

Without this, the merge order of equal-frequency symbols would depend on Go's map iteration order and wobble run to run.

Edge case: a single distinct symbol

"aaaa" has no branching tree — the one leaf is the root. Naively its code is the empty string, so encoding produces zero bits and can't be decoded. The fix: assign code "0" when there's only one symbol.

if root.leaf {
    codes[root.sym] = "0"  // single symbol → 1-bit code
    return codes
}

func TestSingleSymbol(t *testing.T) {
    codes := BuildCodes(Frequencies("aaaa"))
    if codes['a'] != "0" { t.Error(...) }   // and "aaaa" -> "0000" -> "aaaa"
}

Reporting the ratio

Analyze compares "original bytes × 8" to the Huffman bit count:

$ huff --table "mississippi"
symbols:       4
original:      88 bits (11 bytes x 8)
huffman:       21 bits
ratio:         0.239
space saved:   76.1%

mississippi is skewed (i and s appear 4× each), so it compresses 76%. The more skewed the distribution (lower entropy), the better Huffman does.

Architecture

huffman/huffman.go — tree build (heap), code assignment, encode/decode, stats
main.go            — CLI: stats, --table, --bits, stdin

Zero external dependencies. Built with golang:1.23-alpine → alpine runtime, non-root. Passes go test -race.

Run it

go test ./... -race
go run . --table "mississippi"
docker build -t huff . && docker run --rm huff --table "abracadabra"

GitHub: https://github.com/sen-ltd/huffman-cli

Takeaways

Huffman codes are prefix-free, so the bit stream decodes uniquely with no separators.
Construction is a greedy "merge the two rarest nodes" loop — that's what minimizes average code length.
Guard correctness with properties: prefix-free, frequent ⇒ shorter, round-trip, determinism.
Break heap ties on insertion order for a deterministic table.
A single symbol needs a "0" special case — an empty code can't be decoded.
Count frequencies per rune and Unicode just works.

This is OSS portfolio #270 from SEN LLC (Tokyo). https://sen.ltd/portfolio/

DEV Community