A CLI that compresses text with Huffman coding and reports the ratio, in Go. The essence of Huffman coding: give frequent symbols short bit strings, rare ones long ones, and make the code prefix-free — no code is a prefix of another — so a concatenated bit stream decodes unambiguously with no separators. The hinge: a greedy construction that just merges the two lowest-frequency nodes repeatedly produces the optimal code tree — and pinning that with property tests.
📦 GitHub: https://github.com/sen-ltd/huffman-cli
(It's a CLI — no live demo. Run with go run or Docker.)
What "prefix-free" buys you
Fixed-width codes (8-bit ASCII) decode without separators. Go variable-length and "where does one symbol end?" becomes the problem: with a=0, b=01, the stream 001 is ambiguous.
Prefix-free removes the ambiguity. Since no code is a prefix of another, you read the bit stream left to right, walk the code tree, and the moment you hit a leaf you've finished exactly one symbol. Unique decoding, zero separators.
The hinge: merge the two rarest nodes
The tree construction is a surprisingly simple greedy algorithm:
- Make each symbol a leaf weighted by its frequency.
- Put them all in a min-heap.
- Pop the two lowest-frequency nodes, merge them under a parent (frequency = sum), push it back.
- Repeat until one node remains.
for q.Len() > 1 {
a := heap.Pop(q).(*node) // lowest
b := heap.Pop(q).(*node) // next lowest
heap.Push(q, &node{freq: a.freq + b.freq, left: a, right: b})
}
Left edge adds 0, right adds 1; each leaf's root-to-leaf path is its code:
var walk func(n *node, prefix string)
walk = func(n *node, prefix string) {
if n.leaf { codes[n.sym] = prefix; return }
walk(n.left, prefix+"0")
walk(n.right, prefix+"1")
}
Why it's optimal: merging the two rarest nodes first pushes them to the deepest positions (longest codes), while frequent symbols, merged late, stay shallow (short codes). "Rarer ⇒ deeper" is achieved greedily, minimizing average code length (approaching the Shannon entropy).
Pin the properties with tests
Huffman correctness is best guarded by properties, not specific output values.
1. Prefix-free
For every pair, neither is a prefix of the other:
func TestPrefixFree(t *testing.T) {
codes := BuildCodes(Frequencies("abracadabra"))
for i := range list {
for j := range list {
if i != j && hasPrefix(list[i], list[j]) {
t.Errorf("%q is a prefix of %q — not prefix-free", list[j], list[i])
}
}
}
}
2. Frequent ⇒ shorter
A higher-frequency symbol never gets a longer code than a rarer one:
freq := map[rune]int{'a': 5, 'b': 2, 'c': 1}
codes := BuildCodes(freq)
if len(codes['a']) > len(codes['b']) { t.Error(...) } // 'a' must be shortest
3. Round-trip
decode(encode(text)) == text for ASCII, Unicode, and single-symbol input:
for _, text := range []string{"abracadabra", "mississippi", "日本語のテキストも符号化できる"} {
codes := BuildCodes(Frequencies(text))
bits, _ := Encode(text, codes)
got, _ := Decode(bits, codes)
if got != text { t.Errorf("round trip failed: %q -> %q", text, got) }
}
Frequencies are counted per rune, so Unicode works.
4. Deterministic
Identical input always yields an identical table. Ties (equal frequency) break on a stable insertion order:
func (p pq) Less(i, j int) bool {
if p[i].freq != p[j].freq { return p[i].freq < p[j].freq }
return p[i].order < p[j].order // equal freq → deterministic by insertion
}
Without this, the merge order of equal-frequency symbols would depend on Go's map iteration order and wobble run to run.
Edge case: a single distinct symbol
"aaaa" has no branching tree — the one leaf is the root. Naively its code is the empty string, so encoding produces zero bits and can't be decoded. The fix: assign code "0" when there's only one symbol.
if root.leaf {
codes[root.sym] = "0" // single symbol → 1-bit code
return codes
}
func TestSingleSymbol(t *testing.T) {
codes := BuildCodes(Frequencies("aaaa"))
if codes['a'] != "0" { t.Error(...) } // and "aaaa" -> "0000" -> "aaaa"
}
Reporting the ratio
Analyze compares "original bytes × 8" to the Huffman bit count:
$ huff --table "mississippi"
symbols: 4
original: 88 bits (11 bytes x 8)
huffman: 21 bits
ratio: 0.239
space saved: 76.1%
mississippi is skewed (i and s appear 4× each), so it compresses 76%. The more skewed the distribution (lower entropy), the better Huffman does.
Architecture
huffman/huffman.go — tree build (heap), code assignment, encode/decode, stats
main.go — CLI: stats, --table, --bits, stdin
Zero external dependencies. Built with golang:1.23-alpine → alpine runtime, non-root. Passes go test -race.
Run it
go test ./... -race
go run . --table "mississippi"
docker build -t huff . && docker run --rm huff --table "abracadabra"
Takeaways
- Huffman codes are prefix-free, so the bit stream decodes uniquely with no separators.
- Construction is a greedy "merge the two rarest nodes" loop — that's what minimizes average code length.
- Guard correctness with properties: prefix-free, frequent ⇒ shorter, round-trip, determinism.
- Break heap ties on insertion order for a deterministic table.
- A single symbol needs a
"0"special case — an empty code can't be decoded. - Count frequencies per rune and Unicode just works.
This is OSS portfolio #270 from SEN LLC (Tokyo). https://sen.ltd/portfolio/

Top comments (0)